)]}'
{
  "log": [
    {
      "commit": "0e1101aa9349ec5c4dd3abb8fd4d16fb05a25ae7",
      "tree": "129c27d2ab4586cd5de25438cb2a4ac84bd52408",
      "parents": [
        "5d3aa9a3fb32fd4a35a3d8f1b6d55e1b8c7e804d"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Jun 12 17:01:35 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Fri Jun 12 17:01:35 2026 +0800"
      },
      "message": "[SPARK-57388][INFRA] Pin downstream actions/checkout to a single resolved SHA in maven_test.yml and python_hosted_runner_test.yml\n\n### What changes were proposed in this pull request?\n\nIn `.github/workflows/maven_test.yml` and `.github/workflows/python_hosted_runner_test.yml`, add a step to the precompile job (`precompile-maven` / `precompile`) that captures `git rev-parse HEAD` right after the apache/spark checkout, expose it as a `head_sha` job output, and switch the downstream `build` job\u0027s `actions/checkout` from `ref: ${{ inputs.branch }}` to `ref: ${{ needs.\u003cprecompile-job\u003e.outputs.head_sha || inputs.branch }}`.\n\nThis is the same pinning that SPARK-56866 applied to `build_and_test.yml`, with one adaptation. Unlike the mandatory `precondition` job there (downstream jobs are skipped when it fails), the precompile jobs here are best-effort: they run with `continue-on-error: true` and the `build` job proceeds with `if: (!cancelled())` even when precompile fails. The `|| inputs.branch` fallback keeps that degraded path intact: if the precompile job dies before resolving the SHA, the `build` job resolves `inputs.branch` itself, exactly as today. Without the fallback, an empty `ref:` would make `actions/checkout` fall back to the default branch or the (master) event SHA, which is wrong for `branch-4.x` runs.\n\n### Why are the changes needed?\n\nThese two reusable workflows have the same cross-job checkout race that SPARK-56866 fixed in `build_and_test.yml`: the precompile job builds Spark and uploads the compiled output as an artifact, then each `build` matrix entry independently re-resolves `ref: ${{ inputs.branch }}` at the moment its runner picks it up. The matrix only starts after the full precompile build finishes (tens of minutes), so the drift window is structurally long. If the branch advances in between, a `build` entry checks out a newer commit than what was precompiled and runs tests against stale compile artifacts extracted on top of it, producing the same class of spurious mixed-commit failures described in SPARK-56866.\n\nWorkflows covered by this change:\n- `maven_test.yml` is called by 11 workflows: `build_maven.yml`, `build_maven_java21.yml`, `build_maven_java21_arm.yml`, `build_maven_java21_macos26.yml`, `build_maven_java25.yml`, and `build_branch4{0,1,2}_maven*.yml`.\n- `python_hosted_runner_test.yml` is called by `build_python_3.12_arm.yml` and `build_python_3.12_macos26.yml`.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. CI infrastructure only.\n\n### How was this patch tested?\n\nYAML syntax validated locally. These workflows are not exercised by PR CI (their callers are schedule-triggered daily jobs), so the change takes effect on their next scheduled runs; it mirrors the `build_and_test.yml` change that has been running since 2026-05-19.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (claude-fable-5)\n\nCloses #56450 from zhengruifeng/ci-pin-checkout-sha-other-workflows-dev4.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "5d3aa9a3fb32fd4a35a3d8f1b6d55e1b8c7e804d",
      "tree": "b7df83712d043045716fd6c2dd6501763bd3f16a",
      "parents": [
        "3d4c8b0ce633cc95a3a7f35c28bf746d2698e9c5"
      ],
      "author": {
        "name": "BRIJ RAJ KISHORE",
        "email": "22271048+brijrajk@users.noreply.github.com",
        "time": "Fri Jun 12 07:53:15 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 12 07:53:15 2026 +0200"
      },
      "message": "[SPARK-34679][SQL][DOC] Add inferTimestamp option to JSON data source options table\n\n### What changes were proposed in this pull request?\n\nAdded the missing `inferTimestamp` option to the Data Source Options table in\n`docs/sql-data-sources-json.md`. The option was absent from the table despite\nbeing available since Spark 3.0.1 and referenced in the migration guide.\n\nThe new row is placed after `timestampNTZFormat` (the options it relates to)\nwith the description matching the Scaladoc in `JSONOptions.scala`:\n\n\u003e Allows inferring of `TimestampType` and `TimestampNTZType` from strings that\n\u003e match the timestamp patterns defined by the `timestampFormat` and\n\u003e `timestampNTZFormat` options respectively.\n\n### Why are the changes needed?\n\nUsers have no way to discover `inferTimestamp` from the official options\nreference. The option is already mentioned in the migration guide\n(`sql-migration-guide.md`) but was never added to the options table.\n\nFixes https://issues.apache.org/jira/browse/SPARK-34679\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Documentation only.\n\n### How was this patch tested?\n\nNo code change — documentation only. Verified the option is defined in\n`JSONOptions.scala` with `inferTimestamp` as the key name and `false` as the\ndefault.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude (Anthropic)\n\nCloses #56248 from brijrajk/SPARK-34679-infertimestamp-json-doc.\n\nAuthored-by: BRIJ RAJ KISHORE \u003c22271048+brijrajk@users.noreply.github.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "3d4c8b0ce633cc95a3a7f35c28bf746d2698e9c5",
      "tree": "c82fa690d2a4c944f18610a08e0f556a10879759",
      "parents": [
        "913f105ed50aeafc95e31699acfb9a45e056e86b"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "sunchao@apache.org",
        "time": "Fri Jun 12 11:28:53 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Fri Jun 12 11:28:53 2026 +0800"
      },
      "message": "[SPARK-56903][SQL][FOLLOWUP] Fix null join key shuffle config version\n\n## Why are the changes needed?\n\n`spark.sql.shuffle.spreadNullJoinKeys.enabled` was introduced on the Spark 4.3 development line, but its configuration metadata incorrectly says that it was introduced in Spark 4.1.0.\n\nThis follow-up addresses the review comment on apache/spark#55927 so the documented version matches the first Spark release that contains the configuration.\n\n## What changes were proposed in this PR?\n\nUpdate the configuration\u0027s `version` metadata from `4.1.0` to `4.3.0`.\n\n## How was this PR tested?\n\n- Ran `git diff --check`.\n- No runtime tests were added because this change only corrects configuration metadata and does not affect behavior.\n\nCloses #56462 from sunchao/dev/chao/codex/spark-56903-config-version.\n\nAuthored-by: Chao Sun \u003csunchao@apache.org\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "913f105ed50aeafc95e31699acfb9a45e056e86b",
      "tree": "bf15496fa59cae7f8cdc1acfb143646dafd722d1",
      "parents": [
        "f64739db30aa31391e50dfedd1d49faf4ad3d99b"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "pan3793@gmail.com",
        "time": "Fri Jun 12 11:25:38 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Fri Jun 12 11:25:38 2026 +0800"
      },
      "message": "[SPARK-57387][YARN] Make executor JVM options `-XX:OnOutOfMemoryError` configurable on YARN\n\n### What changes were proposed in this pull request?\n\nCurrently, Spark always adds `-XX:OnOutOfMemoryError\u003d\u0027kill %p\u0027` to executor JVM options when running on YARN. This PR makes the behavior configurable.\n\n### Why are the changes needed?\n\nThere are many places where Spark\u0027s internal memory management handles `OutOfMemoryError` and intends to recover from that, with `-XX:OnOutOfMemoryError\u003d\u0027kill %p\u0027`, the recovery code does not work, and the executor always gets killed when hit `OutOfMemoryError`.\n\nNote: when `-XX:OnOutOfMemoryError\u003d\u0027kill %p\u0027` is configured, `kill %p` will be performed no matter whether the code catches the `OutOfMemoryError` when it is thrown by JVM.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nUTs are added.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8.\n\nCloses #56448 from pan3793/SPARK-57387.\n\nAuthored-by: Cheng Pan \u003cpan3793@gmail.com\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "f64739db30aa31391e50dfedd1d49faf4ad3d99b",
      "tree": "14f815a6c226e690fa6a462159bde44d9a3de120",
      "parents": [
        "334952601082732e809ec12cf11319a60dd32ea4"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Fri Jun 12 00:06:19 2026 +0000"
      },
      "committer": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Fri Jun 12 00:06:19 2026 +0000"
      },
      "message": "[SPARK-57381][PYTHON] Refactor SQL_WINDOW_AGG_PANDAS_UDF\n\n### What changes were proposed in this pull request?\n\nRefactor `SQL_WINDOW_AGG_PANDAS_UDF` to be self-contained in `read_udfs()`: the bounded/unbounded window logic moves from the wrapper functions and the old mapper into a single execution block that uses `ArrowStreamGroupSerializer` as pure I/O, converting between Arrow and pandas via `ArrowBatchTransformer.to_pandas` and `PandasToArrowConversion.convert`. The wrappers `wrap_window_agg_pandas_udf`, `wrap_unbounded_window_agg_pandas_udf` and `wrap_bounded_window_agg_pandas_udf` are removed, and `ArrowStreamAggPandasUDFSerializer` is now only used by `SQL_GROUPED_AGG_PANDAS_ITER_UDF`.\n\nSame pattern as #55153 (`SQL_WINDOW_AGG_ARROW_UDF`).\n\n### Why are the changes needed?\n\nPart of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests. No behavior change.\n\nASV micro-benchmark `WindowAggPandasUDFTimeBench` (`-a repeat\u003d3`), master (before) vs this branch (after), representative run shown with diff computed on its central values. Results are consistent across 3 runs on each side (12-21% improvement on the 3-run averages); `WindowAggPandasUDFPeakmemBench` shows no change.\n\n```text\nscenario         udf                before        after     diff\nfew_groups_sm    sum              59.5±3ms     50.3±6ms   -15.5%\nfew_groups_sm    mean_multi       66.8±6ms     50.8±2ms   -24.0%\nfew_groups_lg    sum               122±2ms     92.1±2ms   -24.5%\nfew_groups_lg    mean_multi        133±2ms   98.1±0.3ms   -26.2%\nmany_groups_sm   sum            2.05±0.02s   1.64±0.01s   -20.0%\nmany_groups_sm   mean_multi      2.35±0.2s   1.85±0.02s   -21.3%\nmany_groups_lg   sum               573±6ms      479±3ms   -16.4%\nmany_groups_lg   mean_multi       642±20ms    529±0.6ms   -17.6%\nwide_cols        sum              566±20ms     436±20ms   -23.0%\nwide_cols        mean_multi       553±40ms     452±20ms   -18.3%\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56442 from Yicong-Huang/refactor/window-agg-pandas-udf.\n\nAuthored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\n"
    },
    {
      "commit": "334952601082732e809ec12cf11319a60dd32ea4",
      "tree": "6a7e20480eb7c6b7994bd5806cc3a565f5efc03a",
      "parents": [
        "b88952cabfff54973957cdc21d3a548acf951ac1"
      ],
      "author": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Thu Jun 11 16:50:58 2026 -0700"
      },
      "committer": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Thu Jun 11 16:50:58 2026 -0700"
      },
      "message": "[SPARK-57360][SQL] Block temporary variables in generated column expressions\n\n### What changes were proposed in this pull request?\nThis PR adds a validation that rejects references to session (temporary) variables in a generated column\u0027s generation expression, for both `CREATE TABLE` and `REPLACE TABLE`. The check is added to the central `GeneratedColumnExpression.validate` and throws `UNSUPPORTED_EXPRESSION_GENERATED_COLUMN` with a clear reason.\n\n### Why are the changes needed?\nThis is a regression from #54126 ([SPARK-55347][SQL] Pass Generated Column as Expression to DSV2). Previously, the generation expression was analyzed by an independent (standalone) analyzer that did not resolve session variables, so a generated column referencing a temporary variable was not accepted. After #54126 moved the analysis inline into the main analyzer (similar to how constraint expressions are analyzed), the analyzer\u0027s last-resort variable resolution now resolves the reference to a `VariableReference` (which is deterministic and foldable) and constant-folds it, so it silently slips past the existing generated column validations (subquery / self-reference / other generated column / non-deterministic / type / collation checks).\n\nPersisting a session-scoped, mutable value into a generation expression is ill-defined: the value recomputed when reading or rewriting the column can differ across sessions or over time. The SQL standard restricts generation expressions to be deterministic and to reference only other columns of the same row, and other engines explicitly disallow variables:\n- **MySQL**: \"Variables (system variables, user-defined variables, and stored program local variables) are not permitted.\"\n- **HSQLDB** (close to the standard): the expression \"must reference only other, non-generated, columns of the table in the same row ... must be deterministic and must not access SQL-data.\"\n\nThis restores the previous protection and aligns Spark with the standard.\n\n### Does this PR introduce _any_ user-facing change?\nYes. A `CREATE TABLE` / `REPLACE TABLE` whose generated column references a temporary variable now fails analysis with `UNSUPPORTED_EXPRESSION_GENERATED_COLUMN` (reason: \"generation expression cannot reference temporary variables\") instead of being accepted.\n\nExample:\n```sql\nDECLARE my_var INT DEFAULT 1;\nCREATE TABLE t(a INT, b INT GENERATED ALWAYS AS (a + my_var)) USING foo;\n-- now fails with UNSUPPORTED_EXPRESSION_GENERATED_COLUMN\n```\n\n### How was this patch tested?\nAdded unit tests in `DataSourceV2SQLSuite` covering both `CREATE TABLE` and `REPLACE TABLE`.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56421 from szehon-ho/block_temp_variable_from_generated_column.\n\nAuthored-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\nSigned-off-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\n"
    },
    {
      "commit": "b88952cabfff54973957cdc21d3a548acf951ac1",
      "tree": "58aca97e9e4a4f88e2f64b6a81605ef41fd240eb",
      "parents": [
        "bcdcd47c6688474e36b69e65cb1778838036fa4f"
      ],
      "author": {
        "name": "Huaxin Gao",
        "email": "huaxin.gao11@gmail.com",
        "time": "Thu Jun 11 15:59:10 2026 -0700"
      },
      "committer": {
        "name": "huaxin-gao_snow",
        "email": "huaxin.gao@snowflake.com",
        "time": "Thu Jun 11 15:59:10 2026 -0700"
      },
      "message": "[SPARK-57393] Build: PySpark and SparkR source distributions are missing LICENSE and NOTICE files\n\n### What changes were proposed in this pull request?\n\nMake the PySpark and SparkR source distributions include top-level LICENSE and NOTICE files:\n\n- Add include LICENSE / include NOTICE to python/MANIFEST.in.\n- In dev/make-distribution.sh, copy LICENSE/NOTICE into python/ and R/pkg/ before building the sdists (with an EXIT trap to clean them up).\n- The classic pyspark sdist bundles the assembly jars, so it ships the binary license variants (LICENSE-binary/NOTICE-binary plus the full licenses-binary set), mirroring the binary distribution. The pyspark_connect, pyspark_client, and SparkR artifacts bundle no jars, so they ship the plain source LICENSE/NOTICE. packaging/classic/setup.py falls back to licenses/ when licenses-binary/ is absent (RELEASE-mode builds).\n- During the SparkR build, + file LICENSE is added to DESCRIPTION temporarily (restored after the build) so R CMD check --as-cran does not warn that the bundled LICENSE is not mentioned. The committed DESCRIPTION is unchanged, so SparkR CI is unaffected. The \"Non-standard file/directory found at top level: \u0027NOTICE\u0027\" NOTE cannot be silenced this way and is expected in release-build logs.\n- Add tar-listing regression guards after the sdist and SparkR builds that fail the release build if LICENSE or NOTICE ever goes missing again, instead of relying on an RC vote to catch it.\n\n### Why are the changes needed?\n\nASF release policy requires every distributed artifact to include LICENSE and NOTICE. The pyspark, pyspark_connect, pyspark_client, and SparkR source tarballs currently don\u0027t, which was raised as a -1 during the Spark 4.2.0 RC1 vote.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nBuilt the source distributions locally and confirmed each contains LICENSE and NOTICE at the package root\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes, drafted with assistance from Cursor.\n\nCloses #56453 from huaxingao/fix_license.\n\nAuthored-by: Huaxin Gao \u003chuaxin.gao11@gmail.com\u003e\nSigned-off-by: huaxin-gao_snow \u003chuaxin.gao@snowflake.com\u003e\n"
    },
    {
      "commit": "bcdcd47c6688474e36b69e65cb1778838036fa4f",
      "tree": "20684e7017299efbf03a239acb1157ec389dc460",
      "parents": [
        "3fce4cfa34213adb30633ee942ceffcc9d58b922"
      ],
      "author": {
        "name": "Nikolina Vraneš",
        "email": "nikolina.vranes@databricks.com",
        "time": "Thu Jun 11 15:56:39 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 15:56:39 2026 -0700"
      },
      "message": "[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution\n\n### What changes were proposed in this pull request?\n\nThis re-lands #56247, which was reverted in #56413. The original change broke the scheduled `pyspark-connect-old-client` CI job (`build_python_connect40.yml`, master server vs branch-4.0 client), because the 7 new non-reserved keywords widened the `sql_keywords().show()` doctest output and the branch-4.0 client still expected the old column width. That doctest has since been made robust in #56406 (now merged: it checks `.columns` instead of `show()`).\n\nThe description follows:\n\n---\n\nThis is the first PR in a planned series implementing the `BIN BY` relation operator (SPARK-57133). It adds the parser, analyzer, and error classes. Physical execution is intentionally stubbed and lands in a follow-up PR.\n\n`BIN BY` is a relation-level operator (same grammar position as `PIVOT` / `UNPIVOT`) that aligns range-typed rows to fixed-width bin boundaries: it splits any row whose `[range_start, range_end)` crosses a boundary and proportionally redistributes selected FLOAT/DOUBLE values across the resulting sub-ranges. DISTRIBUTE UNIFORM columns must be FLOAT or DOUBLE; other types are rejected with `BIN_BY_DISTRIBUTE_TYPE_MISMATCH`, so callers cast to DOUBLE in an upstream projection. The target use case is telemetry and observability data, where each row carries its own measurement window (OpenTelemetry, Prometheus exports).\n\nSyntax:\n```sql\nSELECT * FROM relation BIN BY (\n  RANGE rangeStartCol TO rangeEndCol\n  BIN WIDTH widthExpr\n  [ALIGN TO originExpr]\n  DISTRIBUTE UNIFORM (distributeCol [, distributeCol ...])\n  [BIN_START AS aliasName] [BIN_END AS aliasName] [BIN_DISTRIBUTE_RATIO AS aliasName]\n) [AS resultAlias];\n```\n\nWhat this PR adds:\n- Grammar (`SqlBaseLexer.g4`, `SqlBaseParser.g4`): the `binByClause` rule and 7 new non-reserved keywords (`BIN`, `WIDTH`, `ALIGN`, `UNIFORM`, `BIN_START`, `BIN_END`, `BIN_DISTRIBUTE_RATIO`), wired into `relationExtension` and the pipe `operatorPipeRightSide`, with an optional trailing table alias.\n- Logical plans (`basicLogicalOperators.scala`): `UnresolvedBinBy` (parser output) and the resolved `BinBy`, plus the `BinByOutputAliases` helper. This follows the two-class `Unpivot` -\u003e `UnpivotTransformer` precedent.\n- AST builder (`AstBuilder.scala`): `withBinBy`, which wraps the node in a `SubqueryAlias` when a trailing alias is present.\n- Analyzer rule (`ResolveBinBy.scala`, wired into `Analyzer.scala`): resolves column references against the child output, validates types and foldability, folds the `BIN WIDTH` and `ALIGN TO` expressions to micros (each guarded so a foldable-but-throwing expression, e.g. an ANSI CAST failure, surfaces as a clean `BIN_BY_*` error rather than `INTERNAL_ERROR`), fills the default origin (session-zone-anchored for `TIMESTAMP`, wall-clock epoch for `TIMESTAMP_NTZ`), captures the session time zone, and builds the output schema. Registered in `RuleIdCollection`; the `BIN_BY` / `UNRESOLVED_BIN_BY` tree patterns are added in `TreePatterns`.\n- Self-join support (`DeduplicateRelations.scala`): `BinBy` is an attribute-producing node, so it is registered in both dedup phases (`renewDuplicatedRelations` and `collectConflictPlans`) to renew the appended attributes\u0027 `ExprId`s for self-joins over a shared `BinBy` subtree, matching the `Generate` / `AttachDistributedSequence` producer pattern.\n- Error classes (`error-conditions.json`, `QueryCompilationErrors.scala`): the 11 `BIN_BY_*` conditions, with analysis-time builders for the 10 raised during resolution, including `BIN_BY_INVALID_ALIGN_TO` for an `ALIGN TO` expression that fails to fold (the runtime `BIN_BY_INVALID_RANGE` is defined here and raised in the execution PR).\n- Execution stub (`SparkStrategies.scala`): the lowering throws `UNSUPPORTED_FEATURE.BIN_BY` until the execution PR lands.\n\nThe output is the input columns plus three appended columns: `bin_start` and `bin_end` (matching the range column type) and `bin_distribute_ratio` (DOUBLE, the fraction of the original range that fell into the bin). All three are renameable.\n\n### Why are the changes needed?\n\nTelemetry and observability sources emit rows that each carry their own `[start, end)` measurement window. Re-bucketing such data onto a fixed grid today requires verbose SQL with manual boundary arithmetic, row explosion, and proportional splitting. `BIN BY` expresses this as a single relation operator.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The operator parses and resolves, but physical execution is intentionally stubbed in this PR (the strategy throws an `UNSUPPORTED_FEATURE` error), so `BIN BY` is not usable end to end yet; execution arrives in a follow-up PR. The 7 new keywords are non-reserved, so existing queries that use them as identifiers continue to parse unchanged.\n\n### How was this patch tested?\n\nNew unit tests, all passing:\n- `PlanParserSuite`: `BIN BY` parsing (minimal and maximal clauses, qualified column references, output renames, trailing alias, and the pipe form), parse-error cases, and confirmation that the new keywords remain usable as identifiers.\n- `ResolveBinBySuite`: resolution against the child output, session-zone capture, default-origin arithmetic (UTC, non-UTC, NTZ), output schema and renames, multipart disambiguation across a join, DISTRIBUTE UNIFORM accepting only FLOAT/DOUBLE (other types rejected), and every analysis-time error class.\n- `BinBySuite`: end-to-end check that a `BIN BY` query analyzes successfully but its physical execution surfaces `UNSUPPORTED_FEATURE.BIN_BY` (the interim stub) rather than an internal error.\n\nAll passing via `build/sbt \u0027catalyst/testOnly *ResolveBinBySuite *PlanParserSuite\u0027`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #56426 from vranes/bin-by-parser.\n\nAuthored-by: Nikolina Vraneš \u003cnikolina.vranes@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "3fce4cfa34213adb30633ee942ceffcc9d58b922",
      "tree": "f3a6361c3734efc377e02c45dab927432531473f",
      "parents": [
        "a423d06f3d9f14b4dd112ae255ed9317b9ab5ab7"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Thu Jun 11 15:44:42 2026 -0700"
      },
      "committer": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Thu Jun 11 15:44:42 2026 -0700"
      },
      "message": "[SPARK-57073][SS][PYTHON][TEST] Catch AnalysisException for test_parity_listener\n\n### What changes were proposed in this pull request?\n\nCatch `AnalysisException` for the test and ignore it because it\u0027s acceptable (table not created yet).\n\n### Why are the changes needed?\n\nThe test has been flaky on the Build / Python-only (master, Python 3.12, MacOS26) scheduled workflow:\n\n2026-05-23 — https://github.com/apache/spark/actions/runs/26346300968/job/77556662680\n2026-05-25 — https://github.com/apache/spark/actions/runs/26423905857/job/77783724134\nBoth failed with AnalysisException: TABLE_OR_VIEW_NOT_FOUND on listener_terminated_events: the onQueryTerminated callback fires asynchronously after q.stop() returns and writes the table via saveAsTable, but on slower macOS runners the read races the write.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nLocal test passed. This fails on MacOS more frequently so we need to observe it in scheduled tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56309 from gaogaotiantian/test-parity-listener.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\n"
    },
    {
      "commit": "a423d06f3d9f14b4dd112ae255ed9317b9ab5ab7",
      "tree": "fa94fcaf681268f1d9c73b8884b46e5b8f16f063",
      "parents": [
        "a1922b5e2b1507206a4ecbcf65bb62870a3934b3"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Thu Jun 11 15:29:02 2026 -0700"
      },
      "committer": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Thu Jun 11 15:29:02 2026 -0700"
      },
      "message": "[SPARK-57327][INFRA] Move scheduled CIs for 4.1 to branch-4.1\n\n### What changes were proposed in this pull request?\n\nAdd a unified entry for all scheduled CIs for branch-4.1 (`branch41_scheduler.yml`). It uses `gh workflow` to trigger the self-contained build workflows on `branch-4.1`, and removes the per-build `build_branch41_*.yml` files from `master`.\n\nThis follows the same approach as SPARK-56990 (#56046), which did this for `branch-4.x`. SPARK-57267 (#56330) already laid the ground on `branch-4.1` by making the build workflows self-contained and dispatchable, so this PR only needs to change the `master` scheduled tasks.\n\nThe scheduler triggers the following targets on `branch-4.1` (also exposed via `workflow_dispatch`): `build_java17`, `build_java21`, `build_maven`, `build_maven_java21`, `build_non_ansi`, `build_python_3.11`, `build_python_3.14`, `build_python_pypy3.10`. Note this differs from the `branch-4.x` set: there is no `java25` build on `branch-4.1`, and `branch-4.1` additionally has a `pypy3.10` build. The cron times are spread out and chosen to avoid the hours already used by the `branch-4.x` scheduler.\n\n`README.md` is updated so the branch-4.1 badges point at the self-contained workflows filtered by `?branch\u003dbranch-4.1`.\n\n### Why are the changes needed?\n\nThis is part of decoupling our CIs. All `branch-4.1` related CIs should only rely on files on `branch-4.1`, with the exception of this new scheduler file which is needed on `master` to trigger scheduled tasks (scheduled workflows only fire from the default branch).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. CI only.\n\n### How was this patch tested?\n\nThese workflows can be triggered manually via `workflow_dispatch` once merged.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.8)\n\nCloses #56379 from gaogaotiantian/decouple-branch41-scheduler.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\n"
    },
    {
      "commit": "a1922b5e2b1507206a4ecbcf65bb62870a3934b3",
      "tree": "9fc95f8a289ded39acf8338c2707d90af53e13d3",
      "parents": [
        "9357bc9ae05e3848995df2f2f68bb2fa0759e826"
      ],
      "author": {
        "name": "Mihailo Aleksic",
        "email": "mihailo.aleksic@databricks.com",
        "time": "Thu Jun 11 13:51:00 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 13:51:00 2026 -0700"
      },
      "message": "[SPARK-57369][SQL] Move main EXECUTE IMMEDIATE resolution logic to common code\n\n### What changes were proposed in this pull request?\nIn this PR I propose to move main EXECUTE IMMEDIATE resolution logic to common code in order to ease the single-pass implementation.\n\n### Why are the changes needed?\nIn order to ease the single-pass implementation.\n\n### Does this PR introduce _any_ user-facing change?\nNo, it\u0027s just a refactor.\n\n### How was this patch tested?\nExisting tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes.\n\nCloses #56429 from mihailoale-db/execimmrefactor.\n\nAuthored-by: Mihailo Aleksic \u003cmihailo.aleksic@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "9357bc9ae05e3848995df2f2f68bb2fa0759e826",
      "tree": "cbfbbb90751eabeb9223ee4fa4cc3191e588e77e",
      "parents": [
        "8cced6f5bd834f027aadb5b6684566879d89831e"
      ],
      "author": {
        "name": "Anurag Kumar Dwivedi",
        "email": "anuragd916@gmail.com",
        "time": "Thu Jun 11 12:58:56 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 12:58:56 2026 -0700"
      },
      "message": "[SPARK-57295][SQL] Make database location validation consistent for w…\n\n**PR Description**\nWhat changes were proposed in this pull request?\nThis PR updates database location validation to reject whitespace-only location values during namespace/database creation and location alteration.\n\nCurrently Spark rejects only empty string locations:\n`CREATE DATABASE db LOCATION \u0027\u0027`\nwith:\n`INVALID_EMPTY_LOCATION`\n\nHowever, whitespace-only values such as:\n\n```\nCREATE DATABASE db LOCATION \u0027 \u0027\nCREATE DATABASE db LOCATION \u0027\\t\u0027\nCREATE DATABASE db LOCATION \u0027\\n\u0027\n```\nare not rejected by Spark validation.\n\nAs a result:\n- In non-HMS catalog paths, databases may be created successfully using whitespace-only locations.\n- In HMS-backed catalog paths, the request reaches Hive Metastore and fails later with metastore-specific exceptions.\n\nThis PR trims the location value before performing the empty-location validation so that whitespace-only values are treated consistently as empty locations.\n\nThe validation is updated in:\n```\nResolveSessionCatalog\nDataSourceV2Strategy\n```\nAdditional regression tests are added for:\n```\n\u0027 \u0027\n\u0027\\t\u0027\n\u0027\\n\u0027\n```\nfor both:\n```\nnamespace/database creation\nnamespace/database location alteration\n```\n\nBy treating whitespace-only values as empty locations, Spark provides:\n- consistent validation behavior\n- consistent error reporting\n- earlier failure during analysis\n- reduced dependence on catalog-specific validation\n\n**Does this PR introduce any user-facing change?**\nYes.\n\nThe following statements will now fail consistently with:\n`INVALID_EMPTY_LOCATION`\n\n**Examples:**\n```\nCREATE DATABASE db LOCATION \u0027 \u0027\nCREATE DATABASE db LOCATION \u0027\\t\u0027\nCREATE DATABASE db LOCATION \u0027\\n\u0027\n```\n\nJira - https://issues.apache.org/jira/browse/SPARK-57295\n\nCloses #56356 from AnuragKDwivedi/SPARK-57295-db-location-validation.\n\nAuthored-by: Anurag Kumar Dwivedi \u003canuragd916@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "8cced6f5bd834f027aadb5b6684566879d89831e",
      "tree": "0bf10d0fea3c639669608f9c94988778c294735d",
      "parents": [
        "be299a1efd4e30ddbee15e874bb6cfd2ff95f179"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Jun 11 18:23:19 2026 +0000"
      },
      "committer": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Jun 11 18:23:19 2026 +0000"
      },
      "message": "[SPARK-57020][PYTHON][TEST] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF\n\n### What changes were proposed in this pull request?\n\nAdd an ASV micro-benchmark for `SQL_TRANSFORM_WITH_STATE_PANDAS_UDF` to `bench_eval_type.py`.\n\nA stub TCP listener (`_StubStateServer`) satisfies `StatefulProcessorApiClient`\u0027s socket connect; the benchmark UDFs never call any state API so no protocol exchange beyond connect is needed.\n\nScenarios cover few/many groups, small/large group sizes, wide columns, mixed value types (string/binary/boolean) and a nested struct column. UDFs: `identity_udf`, `sort_udf`, `count_udf`. The input pdfs are value-only (the grouping key is projected out before the UDF, mirroring `worker.py`\u0027s `values_gen`), so `identity_udf`/`sort_udf` pass values through and `count_udf` reconstructs the grouping key from the `key` arg to keep it in the output, matching the common transformWithState output shape.\n\n### Why are the changes needed?\n\nPart of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724). Establishes a performance baseline before refactoring `SQL_TRANSFORM_WITH_STATE_PANDAS_UDF`.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\n`COLUMNS\u003d120 ./python/asv run --python\u003dsame --bench \"TransformWithStatePandas\" -a \"repeat\u003d(3,5,5.0)\"` (one of two stable runs):\n\n`TransformWithStatePandasUDFTimeBench`:\n\n```text\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\nscenario        identity_udf  sort_udf    count_udf\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\nfew_groups_sm   742±20ms      768±9ms     724±2ms\nfew_groups_lg   7.04±0.04s    7.23±0.1s   6.63±0.01s\nmany_groups_sm  6.14±0.02s    6.65±0.01s  5.73±0.01s\nmany_groups_lg  3.50±0.02s    3.68±0.04s  3.37±0.02s\nwide_cols       7.51±0.1s     7.51±0.01s  6.82±0.03s\nmixed_cols      3.17±0.02s    3.36±0.05s  2.87±0.03s\nnested_struct   7.47±0.01s    8.62±0.07s  5.19±0.04s\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n```\n\n`TransformWithStatePandasUDFPeakmemBench`:\n\n```text\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\nscenario        identity_udf  sort_udf  count_udf\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\nfew_groups_sm   486M          488M      477M\nfew_groups_lg   569M          581M      541M\nmany_groups_sm  513M          512M      492M\nmany_groups_lg  517M          514M      492M\nwide_cols       620M          621M      585M\nmixed_cols      561M          561M      561M\nnested_struct   589M          589M      589M\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d  \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #56192 from Yicong-Huang/SPARK-57020/bench/tws-pandas.\n\nAuthored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\n"
    },
    {
      "commit": "be299a1efd4e30ddbee15e874bb6cfd2ff95f179",
      "tree": "a2abd952c59ae8d6b94383be6bbf283a0f74ebc3",
      "parents": [
        "89dff6beac6e092ea485b5f40a18bee916a89da6"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Jun 11 18:05:20 2026 +0000"
      },
      "committer": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Jun 11 18:05:20 2026 +0000"
      },
      "message": "[SPARK-57361][PYTHON] Refactor SQL_ARROW_UDTF\n\n### What changes were proposed in this pull request?\n\nThis PR refactors `SQL_ARROW_UDTF` so that the worker uses the plain `ArrowStreamSerializer` for pure Arrow stream I/O, moving the per-batch transformation logic from `ArrowStreamArrowUDTFSerializer` into `read_udtf()` in `worker.py`:\n\n- The input-side flattening (struct columns at `table_arg_offsets` are flattened into `pa.RecordBatch`, other columns are passed as `pa.Array`) moves from `ArrowStreamArrowUDTFSerializer.load_stream` into the `func` execution block (same style as the refactored eval types in `read_udfs()`).\n- The output-side type coercion (`ArrowBatchTransformer.enforce_schema` against the Arrow return type) and struct wrapping (`ArrowBatchTransformer.wrap_struct`) move from the serializer `dump_stream` chain into the `evaluate` wrapper, which now yields ready-to-write record batches instead of `(batch, arrow_return_type)` tuples.\n\n`ArrowStreamArrowUDTFSerializer` itself is left in place and will be removed in a follow-up once it has no remaining usages.\n\n### Why are the changes needed?\n\nPart of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388). Keeping serializers as pure Arrow stream I/O and concentrating eval-type-specific logic in `worker.py` makes the per-eval-type data flow explicit and removes serializer subclasses that exist only to carry per-eval-type transforms.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests (`pyspark.sql.tests.arrow.test_arrow_udtf`, `pyspark.sql.tests.test_udtf`). No behavior change: replaying identical worker input through the old and new code paths produces byte-identical worker output (modulo the timing section) across 9 scenario/UDTF combinations.\n\nASV benchmark comparison (`bench_eval_type.ArrowUDTFTimeBench`, `-a repeat\u003d3`, 3 runs per side, averaged). before \u003d `upstream/master`, after \u003d this PR.\n\n```text\nscenario            udtf             before    after     diff\n------------------  --------------  -------  -------  -------\nsm_batch_few_col    identity_udtf    1.37ms   1.35ms    -1.5%\nsm_batch_few_col    filter_udtf      1.81ms   1.84ms    +1.7%\nsm_batch_few_col    count_udtf       1.07ms   1.05ms    -2.2%\nsm_batch_many_col   identity_udtf    2.42ms   2.33ms    -3.9%\nsm_batch_many_col   filter_udtf      3.50ms   3.43ms    -2.2%\nsm_batch_many_col   count_udtf       1.14ms   1.11ms    -2.6%\nlg_batch_few_col    identity_udtf    2.00ms   2.21ms   +10.7%\nlg_batch_few_col    filter_udtf      3.33ms   3.40ms    +2.2%\nlg_batch_few_col    count_udtf       1.20ms   1.20ms    +0.3%\nlg_batch_many_col   identity_udtf    9.72ms   9.85ms    +1.4%\nlg_batch_many_col   filter_udtf     15.57ms  15.70ms    +0.9%\nlg_batch_many_col   count_udtf       3.54ms   3.49ms    -1.5%\npure_ints           identity_udtf    3.20ms   3.19ms    -0.2%\npure_ints           filter_udtf      4.77ms   4.38ms    -8.1%\npure_ints           count_udtf       1.90ms   1.85ms    -2.6%\npure_strings        identity_udtf    5.16ms   4.79ms    -7.2%\npure_strings        filter_udtf      9.15ms   9.02ms    -1.4%\npure_strings        count_udtf       2.28ms   2.25ms    -1.2%\n```\n\nThe `lg_batch_few_col / identity_udtf` cell is a noise artifact: its per-run confidence intervals overlap (both sides spike to ~2.3ms intermittently on this machine), and re-running it in isolation with min-of-30 direct timing shows the refactored path is faster (min 2.34ms vs 2.59ms, median 2.36ms vs 2.76ms).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56440 from Yicong-Huang/refactor/arrow-udtf.\n\nAuthored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\n"
    },
    {
      "commit": "89dff6beac6e092ea485b5f40a18bee916a89da6",
      "tree": "efd2806227fd6f186b0ae1d5c77280d1e19ab9b6",
      "parents": [
        "302ba67264fe56008105c096ec1f06eb6645b81c"
      ],
      "author": {
        "name": "akshatshenoi-db",
        "email": "akshat.shenoi@databricks.com",
        "time": "Thu Jun 11 10:53:42 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 10:53:42 2026 -0700"
      },
      "message": "[SPARK-57321][SQL] Infer CSV schema from tar archives\n\n### What changes were proposed in this pull request?\n\n\u003e **Stacked on #56193** (CSV tar archive read support). Please review/merge that PR first; until it merges, this PR\u0027s diff will also include its commits. The inference-specific changes are in the top commit.\n\nAdds CSV **schema inference** for tar archives (`.tar`/`.tar.gz`/`.tgz`), building on the archive read support in #56193. When `spark.sql.files.archive.reader.enabled` is set and an input path is a tar archive, `CSVDataSource.inferSchema` infers it by streaming the archive\u0027s entries through the existing `ArchiveReader` (never unpacked to disk).\n\nInference runs as a **single `CSVInferSchema` pass over all inputs**: every archive entry and every loose CSV file is tokenized into one `tokenRDD`, keyed on the first input\u0027s header, sampled, and inferred in one pass (the column count is fixed by the first header, and `NullType` columns survive until the final `toStructFields`). Tokenization is **mode-specific**, matching how the scan reads each input: in the default mode each entry is tokenized line-by-line (so a quoted field containing a newline splits across rows, exactly as the line-based scan reads it), while under `multiLine` each entry is parsed as one continuous stream. Each entry\u0027s own first row is dropped as its header — rather than the line model\u0027s \"drop every line equal to the first header\" — matching what the archive scan returns. So an archive (or a mix of archives and loose files) infers the schema the scan actually produces for those same files. `ignoreCorruptFiles` / `ignoreMissingFiles` are honored at whole-input granularity: a corrupt or missing archive (or file) is skipped as a unit.\n\nArchive scanning is wired into the V1 file source only (#56193), so archive schema inference is gated on whether the calling scan path can read archives: `CSVFileFormat` (V1) passes `supportsArchiveScan \u003d true`, while the DSv2 `CSVTable` passes `false`. When the DSv2 path encounters an archive input, inference returns `None` so the read fails loudly with `UNABLE_TO_INFER_SCHEMA`, rather than letting the V2 scan parse raw archive bytes as CSV.\n\nThe enablement flag is read from `FileSourceOptions.archiveFormatEnabled` (added in #56193); no new config is introduced. The config doc is updated to note archives are supported during both scan and schema inference.\n\n### Why are the changes needed?\n\nThe archive feature was split into two PRs to keep each reviewable: #56193 adds reading, and this PR adds schema inference so that `inferSchema\u003dtrue` (and the default inference when no schema is supplied) works for archives the same way it does for a directory of CSV files. Without it, reading an archive without an explicit schema errors with `UNABLE_TO_INFER_SCHEMA`.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The capability is behind the `spark.sql.files.archive.reader.enabled` config (default `false`, introduced in #56193); this PR only extends that opt-in feature to schema inference.\n\n### How was this patch tested?\n\nAdded inference tests to `CSVArchiveReadBase`, run in both header and headerless modes by `CSVHeaderTarArchiveReadSuite` / `CSVHeaderlessTarArchiveReadSuite`:\n- an archive infers the same schema as a directory of the same files;\n- all archive formats (`.tar`/`.tar.gz`/`.tgz`) infer the same schema;\n- a corrupt archive among good ones is skipped under `ignoreCorruptFiles`;\n- a column\u0027s type is widened across entries (e.g. an integer column in one entry and a string column in another infer as string);\n- archive entries and loose files in the same directory infer a single merged schema;\n- a column that is empty in an archived file but typed in a loose file is **not** collapsed to `StringType` — it widens with the loose value, matching a single-pass directory read;\n- when entries have different widths, the column count is fixed by the first entry\u0027s header, and the columns keep their inferred types (pinning the per-entry header drop);\n- inference uses the same record model as the line-based scan: a quoted field with an embedded newline is split across rows in the default mode, so the archive infers the same schema as that entry read as a loose file;\n- the same parity holds under `multiLine`, where the quoted newline stays a single record — pinning the whole-stream tokenization branch;\n- on the DSv2 path (`CSVTable`) an archive input keeps raising `UNABLE_TO_INFER_SCHEMA` (inference returns `None`) instead of inferring a schema the V2 scan would mis-read.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56254 from akshatshenoi-db/archive-format-schema-inference.\n\nAuthored-by: akshatshenoi-db \u003cakshat.shenoi@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "302ba67264fe56008105c096ec1f06eb6645b81c",
      "tree": "f92b4533097278dcc2e27dea603cf549e6c4a161",
      "parents": [
        "e33017a9c62008e007333dad8ca13ab049aaf9d8"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 10:44:31 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 10:44:31 2026 -0700"
      },
      "message": "[SPARK-56877][SQL][FOLLOWUP] Make PartitioningCollection invariant check O(1) per nesting level\n\n### What changes were proposed in this pull request?\n\nMake the `KeyedPartitioning` invariant enforcement added in #55901 (SPARK-56877) cheap for deeply nested `PartitioningCollection`s:\n\n- `checkKeyedPartitioningInvariant()` no longer walks the entire partitioning tree with `TreeNode.foreach` on every construction. Each collection now exposes a cached representative (`firstKeyedPartitioning`); since every nested collection already enforced the invariant on its own construction, comparing one representative per direct member validates the whole subtree by induction. The check is now O(partitionings.size) instead of O(subtree size).\n- `PartitioningCollection.fromPartitionings` no longer rebuilds nested collections that are already consistent. Using the same representative, a nested collection with no `KeyedPartitioning` in its subtree, or whose canonical `partitionKeys` reference already matches, is returned as-is in O(1). It only recurses and rebuilds when interning is actually needed.\n\nThe invariant itself is unchanged and still enforced in the constructor.\n\n### Why are the changes needed?\n\nSPARK-56877 caused a planning-time regression for plans that chain many shuffle joins on the same key. For an inner join, `ShuffledJoin.outputPartitioning` wraps the children\u0027s partitionings in a `PartitioningCollection`, so a chain of N same-key joins nests collections N levels deep, and `outputPartitioning` is a `def` recomputed on every access. With the constructor running a full-subtree walk and `fromPartitionings` rebuilding every nested level (re-triggering the walk at each level), a single `outputPartitioning` evaluation at depth N went from O(N) to O(N^3). On a benchmark query chaining 125 shuffle hash joins, `EnsureRequirements`-phase planning time grew ~9x (312 ms to 2779 ms), regressing end-to-end query time by ~33%.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNew unit tests in `DistributionSuite`:\n- `fromPartitionings` returns already-consistent nested collections reference-equal (guards against reintroducing the rebuild),\n- `fromPartitionings` still interns structurally-equal-but-reference-distinct `partitionKeys` across nesting,\n- the constructor still rejects `partitionKeys` reference mismatches and expression arity mismatches through nesting.\n\nExisting suites covering the invariant and interning behavior pass: `DistributionSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`, `GroupPartitionsExecSuite`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Fable 5)\n\nCloses #56411 from cloud-fan/fix-partitioning-collection-invariant-perf.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "e33017a9c62008e007333dad8ca13ab049aaf9d8",
      "tree": "573d03bd836891ee4d26149c5d1dd5c6ef77516f",
      "parents": [
        "6693d43d11d7d6fb3fc3aa90bc0601c569e9c6ab"
      ],
      "author": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Thu Jun 11 08:51:06 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Thu Jun 11 08:51:06 2026 -0700"
      },
      "message": "[SPARK-57383][SQL][PYTHON] Honor configured Arrow zstd compression level when writing Arrow batches\n\n### What changes were proposed in this pull request?\n\nThis PR fixes a bug where the zstd compression level configured via `spark.sql.execution.arrow.compression.zstd.level` was silently ignored everywhere Arrow batches are compressed. Three places shared the same broken pattern:\n\n- `ArrowConverters.ArrowBatchIterator` (SPARK-54134)\n- `PythonArrowInput` (SPARK-54226; also covers `GroupedPythonArrowInput`, which reuses this codec via SPARK-55328)\n- `CoGroupedArrowPythonRunner` (SPARK-54226)\n\nThey constructed `new ZstdCompressionCodec(level)` only to read its codec type, then rebuilt the codec through `CompressionCodec.Factory.INSTANCE.createCodec(codecType)`. The codec type enum does not carry a level, so that single-argument factory overload always builds a codec at the zstd default level (3), dropping the configured one.\n\nThe codec construction is extracted into a shared `ArrowCompressionUtils.createCompressionCodec` helper that constructs the level-carrying codec instance directly (the helper lives in `sql/core` because `sql/api`, where `ArrowUtils` is, has no `arrow-compression` dependency). The level only matters on the write side; the read side looks up the codec by the type recorded in the IPC message, so reads are unaffected and the on-wire format is unchanged.\n\nThe same bug class was found by dbtsai during review of #56334 (https://github.com/apache/spark/pull/56334#discussion_r3391654988); that PR fixes the cache-side instance of the pattern, and this PR fixes the remaining three pre-existing instances.\n\n### Why are the changes needed?\n\nUsers tuning `spark.sql.execution.arrow.compression.zstd.level` for Python UDF exchange or `df.toArrow()` got no effect at all: every level compressed identically at the default level 3, with no error or warning.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. The configured zstd level now actually takes effect; previously all levels behaved like the default level 3. The bug exists in released Spark 4.1.0/4.1.1/4.1.2 (SPARK-54134 and SPARK-54226 were backported to branch-4.1) as well as 4.2.0 RCs and master, so this fix is a candidate for backporting to branch-4.1 and branch-4.2. Note that a branch-4.1 backport needs to fix a fourth copy of the pattern: there `GroupedPythonArrowInput` still has its own codec construction, since the SPARK-55328 deduplication is master-only.\n\n### How was this patch tested?\n\nNew `ArrowCompressionUtilsSuite`. The regression test compresses the same compressible-but-varying batch at zstd level -5 and level 19 and asserts level 19 produces a strictly smaller payload. Against the old codec construction this test fails with byte-identical sizes at both levels (verified locally). A second test covers the `none` codec and the unsupported-codec error.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #56444 from viirya/fix-arrow-zstd-level.\n\nAuthored-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "6693d43d11d7d6fb3fc3aa90bc0601c569e9c6ab",
      "tree": "ab0442baf125dad997d39a8d91b65248b9c12698",
      "parents": [
        "90f6bab057800d99b9ea4faf09a0025427442e42"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 07:06:01 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Jun 11 07:06:01 2026 -0700"
      },
      "message": "[SPARK-57377][INFRA] Add CI check to prevent new entries in the config binding policy exceptions file\n\n### What changes were proposed in this pull request?\n\nThis PR adds a new lightweight CI job `binding-policy` (\"Config binding policy exceptions check\") to `build_and_test.yml`. The job diffs `sql/hive/src/test/resources/conf/binding-policy-exceptions/configs-without-binding-policy-exceptions` against the latest apache/spark base (using the same fork-sync mechanism as the existing `buf` job) and fails if the PR adds any entries to the file, with an error message pointing the author to `.withBindingPolicy()` and the `ConfigBindingPolicy` scaladoc. Removing entries from the file is still allowed (and encouraged).\n\nThe job runs on a bare runner in well under a minute, is gated on the existing `lint` flag from the `precondition` job, only runs on forks (where PR builds run, so the diff against apache/spark master is exactly the PR\u0027s change; on apache/spark itself the job is skipped at the scheduler level and consumes no runner), and skips cleanly on branches where the exceptions file does not exist.\n\nOn top of the CI check, this PR also:\n\n- Documents how to choose a binding policy: the `ConfigBindingPolicy` scaladoc now has a 3-step decision flow (does the config change the result of resolving a view/UDF/procedure body? should persisted objects freeze the create-time value, like ANSI mode or session timezone? otherwise SESSION). `SparkConfigBindingPolicySuite` and the CI job\u0027s error message point to that scaladoc as the single source of truth, and the suite\u0027s failure message now explicitly says not to grow the exceptions file.\n- Fixes the 12 recently-added configs that were put in the exceptions file instead of declaring a policy, all with `NOT_APPLICABLE` (none of them can change the result of resolving a view/UDF/procedure body): `spark.sql.execution.useHashAggregateExec`, `spark.sql.hive.thriftServer.http.sniHostCheckEnabled`, `spark.sql.streaming.queryEvolution.enableSinkEvolution`, `spark.sql.insertNestedTypeCoercion.enabled`, `spark.testing.injectShuffleFetchFailures`, `spark.ui.jetty.sniHostCheckEnabled`, `spark.scheduler.streaming.idAwareLogging.enabled`, `spark.scheduler.streaming.idAwareLogging.queryIdLength`, `spark.driver.limitActiveProcessorCount.enabled`, `spark.executor.limitActiveProcessorCount.enabled`, `spark.yarn.am.limitActiveProcessorCount.enabled`, `spark.history.fs.update.scanDisabledPathPatterns`. The exceptions file shrinks by 12 entries.\n- Relabels the four `spark.sql.window.segmentTree.*` configs from `SESSION` to `NOT_APPLICABLE` per the review discussion (runtime-identical, but `NOT_APPLICABLE` is the accurate label for physical-execution toggles).\n- Fixes the `version()` of `spark.sql.streaming.queryEvolution.enableSinkEvolution` from 4.1.0 to 4.3.0 (it was added after the branch-4.2 cut).\n\n### Why are the changes needed?\n\nThe exceptions file is a frozen list of pre-existing configs that were created before the binding policy was introduced, and it should only ever shrink. `SparkConfigBindingPolicySuite` forces every config to either declare a binding policy or be listed in this file, so PR authors who add a new config without a binding policy see a test failure and \"fix\" it by adding the config to the exceptions file, which defeats the purpose of the enforcement (see https://github.com/apache/spark/pull/56323#discussion_r3389669681 for a recent example). A new config always has a valid policy choice, including `NOT_APPLICABLE` for configs that cannot change the result of resolving SQL views/UDFs/procedures, so additions to the exceptions file are never justified. Only a diff-based CI check can catch this, since to a regular test the grown file looks consistent.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. `SESSION` and `NOT_APPLICABLE` behave identically at runtime, and declaring either for a config that previously had no policy only affects view/UDF resolution for configs that are actually consulted there, which none of the touched configs are.\n\n### How was this patch tested?\n\nThe CI job was verified end to end on this PR itself: a temporary commit adding an entry to the exceptions file made the job fail with the guidance message in ~10 seconds (https://github.com/cloud-fan/spark/actions/runs/27302316185), and the commits that shrink the file pass the check. The config changes are covered by the existing `SparkConfigBindingPolicySuite`, which verifies that every config either declares a policy or is listed in the exceptions file, and that no config does both.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nThis pull request and its description were written by Isaac.\n\nCloses #56437 from cloud-fan/binding-policy-ci-check.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "90f6bab057800d99b9ea4faf09a0025427442e42",
      "tree": "387fb8724272d9d74e9e6e90dc767234b1f1aa1b",
      "parents": [
        "60acc8f317fc1c805f744bb3cb43bd841ac20591"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Jun 11 20:48:24 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Thu Jun 11 20:48:24 2026 +0800"
      },
      "message": "[SPARK-57368][PYTHON][ML][TEST] Fix assertTrue misuse in PySpark tests\n\n### What changes were proposed in this pull request?\n\nReplace `self.assertTrue(a, b)` with `self.assertEqual(a, b)` in cases where the second argument is an expected value, not a failure message.\n\n`unittest.TestCase.assertTrue(expr, msg\u003dNone)` treats the second argument as an optional failure message, so `assertTrue(a, b)` only checks that `a` is truthy — it never compares `a` against `b`. These tests were silently passing regardless of the actual value.\n\nFixed in three files:\n- `sql/tests/test_functions.py`: `assertTrue(row[1], 1)` / `assertTrue(row[2], 1)` — crosstab result checks\n- `ml/tests/test_linalg.py`: repr/str checks for dense and sparse matrices, norm value checks, UDT schema and matrix equality checks\n- `mllib/tests/test_linalg.py`: same patterns (mirrors the `ml` file)\n\n### Why are the changes needed?\n\nThese are silent test bugs: the assertions always pass as long as the first argument is truthy, which it always is (non-empty strings, non-zero numbers, matrix objects). The expected values in the second argument were never verified. `assertEqual` makes the tests actually check what they intended.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nTest-only change. The assertions being fixed already passed (because they only checked truthiness), and after the fix they continue to pass because the actual values match the expected values.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (claude-sonnet-4-6)\n\nCloses #56428 from zhengruifeng/fix-assertTrue-dev2.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "60acc8f317fc1c805f744bb3cb43bd841ac20591",
      "tree": "1a3fff1076677959470e64357bd4d8a3c5b6d823",
      "parents": [
        "9018d84117e4bf9e704e5b0c5246d5add73a74ed"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Thu Jun 11 16:42:41 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Thu Jun 11 16:42:41 2026 +0900"
      },
      "message": "[SPARK-57332][SQL][FOLLOWUP] Fix line length exceeding 100 characters in JDBCSuite and V2ExpressionSQLBuilder\n\n### What changes were proposed in this pull request?\nThis PR fixes a style issue introduced by #56384.\nThere are some lines in `JDBCSuite.scala` and `V2ExpressionSQLBuilder.java`‎ whose length exceed 100 characters.\n\n### Why are the changes needed?\nTo recover CI.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nGA.\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Claude\n\nCloses #56441 from sarutak/decouple-like-pattern-escaping.\n\nLead-authored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nCo-authored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "9018d84117e4bf9e704e5b0c5246d5add73a74ed",
      "tree": "e2c559b2c6faeddf3873aa5bbc72dff56c069706",
      "parents": [
        "f2d11a668b165a396f449efd6c64713989e66b75"
      ],
      "author": {
        "name": "YangJie",
        "email": "yangjie01@baidu.com",
        "time": "Thu Jun 11 11:08:26 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Thu Jun 11 11:08:26 2026 +0800"
      },
      "message": "[SPARK-57263][SQL][FOLLOWUP] Fix Hive 4.2 getTablesByName compatibility\n\n### What changes were proposed in this pull request?\nThis PR updates `HiveClientImpl.getTablesByName` to preserve Spark\u0027s existing behavior when Hive 4.2 throws `NoSuchObjectException` for missing table names in a batch lookup.\n\nOn that specific exception, Spark falls back to single-table lookups and returns only the tables that exist. Other metastore errors continue to be reported through the existing `UNABLE_TO_FETCH_HIVE_TABLES` path.\n\n### Why are the changes needed?\nThis is a follow-up for SPARK-57263. HIVE-27473 (apache/hive#5771) changed the effective Hive 4.2 client path for [getTableObjectsByName](https://github.com/apache/hive/blob/cb06ad72d609e51b6a3a38ccb120e34b4281067c/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L315-L327): in the default catalog path it may call `getTable` for each requested table name, so a missing table can fail the whole batch.\n\nThis caused failures in the Java 21 daily tests:\n- https://github.com/apache/spark/actions/runs/27083890098/attempts/1#summary-79935380891\n\n```\n4.2: getTablesByName when some tables do not exist: org.apache.spark.sql.hive.client.HiveClientSuite\norg.apache.spark.SparkException: [UNABLE_TO_FETCH_HIVE_TABLES] Unable to fetch tables of Hive database: default. SQLSTATE: 58030\nsbt.ForkMain$ForkError: org.apache.spark.SparkException: [UNABLE_TO_FETCH_HIVE_TABLES] Unable to fetch tables of Hive database: default. SQLSTATE: 58030\nat org.apache.spark.sql.errors.QueryExecutionErrors$.cannotFetchTablesOfDatabaseError(QueryExecutionErrors.scala:1761)\nat org.apache.spark.sql.hive.client.HiveClientImpl.getRawTablesByName(HiveClientImpl.scala:430)\nat org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$1(HiveClientImpl.scala:441)\nat org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:297)\nat org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:240)\nat org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:239)\nat org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:277)\nat org.apache.spark.sql.hive.client.HiveClientImpl.getTablesByName(HiveClientImpl.scala:441)\nat org.apache.spark.sql.hive.client.HiveClientSuite.$anonfun$new$25(HiveClientSuite.scala:262)\nat org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)\nat org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)\nat org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)\nat org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)\nat org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:33)\nat org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:44)\nat org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)\nat org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)\nat org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)\nat org.scalatest.Transformer.apply(Transformer.scala:22)\nat org.scalatest.Transformer.apply(Transformer.scala:20)\n```\n\nSpark\u0027s `getTablesByName` contract is to return existing tables and ignore missing names. The fallback keeps that contract while limiting the compatibility handling to the missing-table case.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\n- Pass Github Actions\n- Run Hive 4.2 related regression tests with Java 21: https://github.com/LuciferYang/spark/actions/runs/27182854479/job/80247153949\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Codex (GPT-5)\n\nCloses #56390 from LuciferYang/fix-hive-get-tables-by-name-java21.\n\nAuthored-by: YangJie \u003cyangjie01@baidu.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "f2d11a668b165a396f449efd6c64713989e66b75",
      "tree": "ba0f6ec2e74cccc45c56758e4c2308d7f94f24a4",
      "parents": [
        "79dcac935b2f18c6605cc14a76eccf9efee161d1"
      ],
      "author": {
        "name": "Eric Yang",
        "email": "jiwen624@gmail.com",
        "time": "Wed Jun 10 17:44:00 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Jun 10 17:44:00 2026 -0700"
      },
      "message": "[SPARK-57313][SQL] Fix SampleExec numOutputRows metric when whole-stage codegen is disabled\n\n### What changes were proposed in this pull request?\nUpdate the metric `number of output rows` for non-WSCG code path\n\n### Why are the changes needed?\n`number of output rows` metric is not updated/incremented when WSCG is off - it can be useful to show the numOutputRows and be consistent with the WSCG path.\n\u003cimg width\u003d\"370\" height\u003d\"339\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/3531d904-16c6-456c-a9a0-6b667b2449f8\" /\u003e\n\n### Does this PR introduce _any_ user-facing change?\nYes. `number of output rows` metric gets correct value now when WSCG is off\n\n### How was this patch tested?\nNew test case added.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #56363 from jiwen624/SPARK-57313.\n\nAuthored-by: Eric Yang \u003cjiwen624@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "79dcac935b2f18c6605cc14a76eccf9efee161d1",
      "tree": "2cd8f184a57dd701daeaa1c60f3664407a80507e",
      "parents": [
        "8e22e990994a3e4bf35ddaebe0e06fa5b36e284f"
      ],
      "author": {
        "name": "Ayush",
        "email": "bilalaayush@hotmail.com",
        "time": "Wed Jun 10 17:33:29 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Jun 10 17:33:29 2026 -0700"
      },
      "message": "[SPARK-53840][SQL] Add AS JSON output support for SHOW TABLES and SHOW TABLE EXTENDED\n\n### What changes were proposed in this pull request?\nSupport SHOW TABLES ... [AS JSON] and SHOW TABLE EXTENDED ... [AS JSON] to optionally display table listing metadata in JSON format.\nSQL syntax:\n```\nSHOW TABLES [(IN|FROM) database_name] [[LIKE] pattern] [AS JSON]\nSHOW TABLE EXTENDED [(IN|FROM) database_name] LIKE pattern [AS JSON]\n```\nOutput: json_metadata: String\n\nSHOW TABLES AS JSON:\n`{\"tables\":[{\"name\":\"t1\",\"namespace\":[\"db\"],\"isTemporary\":false}]}`\n\nSHOW TABLE EXTENDED ... AS JSON additionally includes catalog and type:\n`{\"tables\":[{\"catalog\":\"spark_catalog\",\"namespace\":[\"db\"],\"name\":\"t1\",\"type\":\"TABLE\",\"isTemporary\":false}]}`\n\n### Why are the changes needed?\nThe existing text-based output of SHOW TABLES requires fragile string parsing for programmatic consumption.\nA structured JSON format provides a stable, machine-readable contract for tooling and automation.\n\n### Does this PR introduce _any_ user-facing change?\nYes. Two new SQL syntax variants that return JSON output. Existing commands without AS JSON are unaffected.\nSHOW TABLE EXTENDED with both PARTITION and AS JSON is explicitly rejected.\n\n### How was this patch tested?\n- Parser tests in `ShowTablesParserSuite` for all AS JSON variants and the PARTITION + AS JSON error case.\n- Execution tests in `ShowTablesSuiteBase` covering JSON schema validation, empty databases, EXTENDED output, and temp view inclusion.\n- Manual verification in spark-shell\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\n---\nContinuation of #[54824](https://github.com/apache/spark/pull/54824) (closed to move off fork `master` branch for CI)\n\nCloses #56414 from ayushbilala/show-tables-json.\n\nAuthored-by: Ayush \u003cbilalaayush@hotmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "8e22e990994a3e4bf35ddaebe0e06fa5b36e284f",
      "tree": "f91d5743f104750508c1c3feb2e5be065530c161",
      "parents": [
        "f5eabcbd170bdbfb2049b930957fa13b37260e49"
      ],
      "author": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Wed Jun 10 17:19:47 2026 -0700"
      },
      "committer": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Wed Jun 10 17:19:47 2026 -0700"
      },
      "message": "[SPARK-57359][DOC] Document the MERGE INTO statement in the SQL reference\n\n### What changes were proposed in this pull request?\n\nThe SQL reference does not have a page for the `MERGE INTO` statement. This PR adds one at `docs/sql-ref-syntax-dml-merge-into.md` and links it from the DML statements list in `docs/sql-ref-syntax.md`.\n\nThe new page covers:\n- Description, including that `MERGE INTO` requires a Data Source V2 connector that supports row-level operations (linked to the DSV2 reference).\n- Syntax for the `ON` condition and the `WHEN MATCHED`, `WHEN NOT MATCHED [ BY TARGET ]`, and `WHEN NOT MATCHED BY SOURCE` clauses, with their allowed actions (`DELETE`, `UPDATE SET *`, `UPDATE SET col \u003d val`, `INSERT *`, `INSERT (cols) VALUES (...)`).\n- Parameters describing each clause and action.\n- Notes on multiple clauses being evaluated in order, the requirement of at least one `WHEN` clause, and the `MERGE_CARDINALITY_VIOLATION` behavior (including when the cardinality check is skipped).\n- Examples for update/insert, conditional delete/update/insert, delete-not-in-source, and update-not-in-source.\n\nThe `WITH SCHEMA EVOLUTION` clause is intentionally left for a follow-up PR.\n\n### Why are the changes needed?\n\n`MERGE INTO` is a supported SQL statement but has no entry in the SQL reference, so users have no documentation for its syntax and semantics.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Documentation-only update.\n\n### How was this patch tested?\n\nDocs-only change. Reviewed the rendered Markdown for correctness.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56420 from szehon-ho/merge-schema-evolution-docs.\n\nAuthored-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\nSigned-off-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\n"
    },
    {
      "commit": "f5eabcbd170bdbfb2049b930957fa13b37260e49",
      "tree": "9fc52a817e2caa8fe875c300f052861a26d704f5",
      "parents": [
        "da67157dbd89906b37864d23a62bfea8a42fccda"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Jun 10 16:03:52 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Jun 10 16:03:52 2026 -0700"
      },
      "message": "[SPARK-57332][SQL] Fix MySQL backslash escaping in LIKE predicate pushdown via a dialect string-literal escaping hook\n\n### What changes were proposed in this pull request?\n\nBuilding a pushed-down `LIKE` predicate for `STARTS_WITH` / `ENDS_WITH` / `CONTAINS` involves two distinct escaping layers:\n\n1. **LIKE meta-char escaping** -- escape `_`, `%`, and the escape char `\\` itself, using `\\` as the LIKE escape character. This is what `V2ExpressionSQLBuilder.escapeSpecialCharsForLikePattern` does, and it is dialect-independent.\n2. **SQL string-literal escaping** -- make the resulting pattern text survive the target database\u0027s string-literal parser. This is dialect-specific: standard SQL only doubles `\u0027`, but MySQL also processes `\\` inside string literals.\n\nThe base builder only handled layer 1 and assumed a trivial layer 2 (hand-wrapping the value in `\u0027...\u0027`). MySQL is the one dialect whose layer 2 differs, so it had to override all three `visitStartsWith`/`visitEndsWith`/`visitContains` methods purely to emit `ESCAPE \u0027\\\\\u0027`.\n\nThis PR separates the two layers:\n\n- Adds a protected, overridable hook `escapeStringLiteralForLikePattern(String)` to `V2ExpressionSQLBuilder`. It defaults to the identity function, so the generated SQL for every standard-SQL dialect is **byte-for-byte unchanged**.\n- The shared `visitStartsWith`/`visitEndsWith`/`visitContains` now route both the LIKE pattern and the `\\` escape character through this hook (via a small `likeWithEscape` helper), so the LIKE escape character is defined in one place.\n- `MySQLDialect` now overrides **only** `escapeStringLiteralForLikePattern` (doubling backslashes for MySQL\u0027s string-literal layer) and **deletes** its three duplicated visit-method overrides.\n\nThis is a follow-up to the design issue on #56350 (SPARK-57287).\n\n### Why are the changes needed?\n\nSPARK-57287 fixed backslash escaping in `escapeSpecialCharsForLikePattern` (layer 1), which is correct for standard-SQL dialects. But MySQL reuses that base method and adds an extra string-literal unescaping layer: it treats `\\` as an escape character inside string literals (this is exactly why `MySQLDialect` already wrote `ESCAPE \u0027\\\\\u0027` rather than `ESCAPE \u0027\\\u0027`, per SPARK-48172). MySQL\u0027s string-literal parser collapses the single backslash doubling back to one `\\`, so the pushed-down pattern for a value such as `startsWith(\"abc\\\")` resolved to `abc\\%` -\u003e `abc` followed by a literal `%`, returning silently wrong results.\n\nConcretely, before this PR, against MySQL:\n\n```\nspark.table(\"mysql_catalog.db.t\").filter($\"c\".startsWith(\"abc\\\\\"))\n// pushed: c LIKE \u0027abc\\\\%\u0027 ESCAPE \u0027\\\\\u0027  -\u003e after MySQL string-literal parsing: c LIKE \u0027abc\\%\u0027 ESCAPE \u0027\\\u0027\n// matches \"abc\" + literal \"%\", NOT values starting with \"abc\\\"\n```\n\nThe current coarse override surface (reimplement the whole visit method) is also what let SPARK-48172 fix the `ESCAPE` clause but miss the pattern body. Decoupling the two layers fixes the MySQL backslash case and removes the duplicated visit methods, so a dialect only needs to override the visit methods when its `LIKE` matching semantics genuinely differ -- never just for escaping.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. For the MySQL JDBC dialect, `startsWith` / `endsWith` / `contains` predicates on values containing a backslash now push down a correct `LIKE` pattern and return correct results instead of silently wrong results. There is no change for any other dialect (the new hook is identity by default, and the generated SQL is unchanged).\n\n### How was this patch tested?\n\nAdded a unit test in `JDBCSuite` that compiles `STARTS_WITH` / `ENDS_WITH` / `CONTAINS` predicates with both the default dialect and `MySQLDialect`, asserting the generated SQL for backslash values (and that wildcard escaping is preserved). The default-dialect output is unchanged; the MySQL output now doubles backslashes through the string-literal layer. The existing H2-based coverage in `JDBCV2Suite` (added by SPARK-57287) continues to exercise the standard-SQL path end-to-end.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56384 from cloud-fan/decouple-like-pattern-escaping.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "da67157dbd89906b37864d23a62bfea8a42fccda",
      "tree": "27c74eb217019a8992f56a9c4663bed43e2aaf41",
      "parents": [
        "1175d40c3b306d3626c0dfa8d25a8aef035113a7"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Thu Jun 11 07:03:20 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Thu Jun 11 07:03:20 2026 +0900"
      },
      "message": "[SPARK-56887][SQL] Add dedicated sort-merge physical operator for AS-OF join\n\n### What changes were proposed in this pull request?\nAdd `SortMergeAsOfJoinExec`, a dedicated physical operator for AS-OF joins that replaces the existing correlated subquery rewrite (`RewriteAsOfJoin`) when `spark.sql.join.sortMergeAsOfJoin.enabled` is set to `true` (default `false`).\n\nThe operator co-partitions both sides by equi-join keys, sorts by (equi-keys, as-of key), and performs a single-pass merge scan per partition to find the nearest match for each left row. It exploits sort order for early termination by scanning in the optimal direction based on the join direction (right-to-left for backward, left-to-right for forward/nearest).\n\nChanges:\n- New physical operator: `SortMergeAsOfJoinExec`\n- New planner strategy: `AsOfJoinSelection` in `SparkStrategies`\n- Conditional skip of `RewriteAsOfJoin` when the conf is enabled\n- New SQLConf: `spark.sql.join.sortMergeAsOfJoin.enabled`\n\n### Why are the changes needed?\nThe current AS-OF join implementation rewrites `AsOfJoin` to a correlated scalar subquery with `MIN_BY`. This approach is O(N×M) per partition and causes OOM on moderate data sizes (100K+ rows), because the inequality condition (`left.t \u003e\u003d right.t`) cannot be decorrelated into an equi-join.\n\nThe sort-merge operator is O(N+M) per partition after sorting, with early termination within each equi-key group.\n\nBenchmark results on GitHub Actions (AMD EPYC 7763, 10K×10K rows, 100 equi-key groups):\n\n| | JDK 17 | JDK 21 | JDK 25 |\n|---|---|---|---|\n| With equi-key | 631.8× | 601.2× | 676.5× |\n| Without equi-key | 14.0× | 13.3× | 13.7× |\n\nFor 100K×100K rows, the baseline OOMs while the sort-merge operator completes in ~500 ms.\n\n### Why a dedicated operator instead of a Window-over-union rewrite?\nA Window rewrite (UNION ALL both sides with a source marker, then last(right_struct) IGNORE NULLS over ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) is viable for the common case (Backward, allowExactMatches\u003dtrue, no tolerance, no residual) and benefits from WindowExec\u0027s codegen/spill/AQE support. However, it does not generalize cleanly:\n\n- Tie-break ordering when left.t \u003d\u003d right.t requires a synthetic column whose direction depends on allowExactMatches.\n- NULL equi-keys under PARTITION BY semantics differ from EqualTo (same partition vs. no match).\n- Residual pair-correlated predicates (condition with non-equi terms) cannot be expressed in a Window frame.\n- Non-literal tolerance cannot be expressed as RANGE BETWEEN \u003cexpr\u003e PRECEDING.\n- Nearest requires two windows + a pick-closer Project.\n\nThe dedicated operator handles all six direction × tolerance × allowExactMatches combinations uniformly, supports\n  residual conditions, and avoids the UNION ALL materialization overhead. A Window rewrite would still require sorting and materializing the union of both sides, so the constant factor advantage of the dedicated operator (which scans both sides in a single pass without union materialization) is expected to be meaningful, while covering strictly more cases.\n\n### Does this PR introduce _any_ user-facing change?\nNo. The feature is opt-in via a new SQLConf that defaults to `false`. When disabled, the existing `RewriteAsOfJoin` path is used unchanged.\n\n### How was this patch tested?\n- `SortMergeAsOfJoinSuite`: 18 tests covering backward/forward/nearest directions, equi-keys, left outer, tolerance, allowExactMatches\u003dfalse, empty partitions, null keys, multiple data types (Int/Long/Double), self join, no equi-key, and conf-disabled fallback\n- `AsOfJoinBenchmark`: comparative benchmark (correlated subquery vs sort-merge)\n- Existing `DataFrameAsOfJoinSuite`: all 11 tests pass with default conf (no regression)\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude (via Kiro CLI, auto model selection)\n\nCloses #55912 from sarutak/sort-merge-asof-join.\n\nLead-authored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nCo-authored-by: sarutak \u003csarutak@users.noreply.github.com\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "1175d40c3b306d3626c0dfa8d25a8aef035113a7",
      "tree": "fd00d49d01a2b1162d8ae9a39227465114287c27",
      "parents": [
        "d2cbc7f33efc051f0f23ce39ee8e8d1313588b2a"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Thu Jun 11 00:02:40 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Thu Jun 11 00:02:40 2026 +0200"
      },
      "message": "[SPARK-57285][SQL] Route nanosecond timestamp cast-to-string through the Types Framework\n\n### What changes were proposed in this pull request?\n\nThis PR makes the Types Framework (`TypeApiOps`) the single integration point for nanosecond timestamp `CAST(... AS STRING)`, for both the interpreted and codegen paths.\n\nSpecifically:\n\n- `TypeApiOps.apply` gains a by-name `zoneId` parameter that defaults to the session-local time zone config (`SqlApiConf.get.sessionLocalTimeZone`) and is threaded into the `TIMESTAMP_LTZ` nanos ops. It is by-name so the zone is forced only when the LTZ ops is actually constructed: zone-independent (`TimeType`, `TIMESTAMP_NTZ` nanos) and unsupported types never evaluate it, which matters because a `CAST`\u0027s zone is unresolved (`None.get`) until a time zone is assigned.\n- `TimestampLTZNanosTypeApiOps` now carries its `ZoneId` as a required constructor parameter and holds the fraction formatter in a `transient private lazy val`, so the formatter is built once per ops instance (once per cast, per task) rather than per row. `TimestampNTZNanosTypeApiOps` stays zone-independent (UTC).\n- `ToStringBase` no longer bypasses or special-cases the framework: both the interpreted and codegen paths dispatch uniformly through `TypeApiOps(from, zoneId)`. `CAST` passes its resolved zone; zone-less callers accept the session-zone default.\n- The nanosecond ops implement `formatExternal` to render the external Row value (`Instant` for LTZ, `LocalDateTime` for NTZ) at the column precision, so `Row.json` / `Row.prettyJson` render nanos columns through the `formatExternal` routing introduced in #56392 (SPARK-57338). The two-arg `formatExternal(value, nested)` returns `None`, leaving the `HiveResult` output path unchanged.\n- The now-unused `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` error condition and its `DataTypeErrors` helper are removed.\n\nThe microsecond timestamp types (`TIMESTAMP` / `TIMESTAMP_NTZ`) remain handled inline in `ToStringBase` and are out of scope.\n\n### Why are the changes needed?\n\nSPARK-57256 implemented nanosecond cast-to-string inline in `ToStringBase`, deliberately bypassing the framework because the zone-less `TypeApiOps.format(v)` cannot render LTZ in the session time zone. That left nanos cast-to-string as a one-off integration outside the framework, inconsistent with the SPIP direction (SPARK-56822) of wiring the new types through the centralized `TypeOps` / `TypeApiOps`. This PR closes that gap.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, two narrow deltas:\n\n1. `CAST(... AS STRING)` on nanosecond `TIMESTAMP_LTZ` / `TIMESTAMP_NTZ` now dispatches through the Types Framework. The rendered string is unchanged (zone-aware LTZ, zone-independent NTZ, precision flooring, trailing-zero trimming); the only behavioral change is that the zone-less type-level `format()` no longer throws `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` (that error condition is removed).\n2. `Row.json` / `Row.prettyJson` on a nanosecond `TIMESTAMP_LTZ` / `TIMESTAMP_NTZ` column now renders the value (NTZ zone-independent, LTZ in `spark.sql.session.timeZone`) via the `formatExternal` routing added in #56392 (SPARK-57338), instead of raising `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING`.\n\n`EXPLAIN` and SQL-literal `toSQLValue` are unchanged: they don\u0027t route through the framework (`Literal.toString` / `Literal.sql` render via `value.toString`, and `toSQLValue` has no production caller).\n\n### How was this patch tested?\n\n- Updated `TimestampNanosTypeOpsSuite` to cover the new behavior: NTZ renders zone-independently, LTZ renders in the zone carried by the ops instance, and zone-less LTZ now renders in the session-local time zone (instead of raising), exercising precision flooring.\n- Added a `RowJsonSuite` test that `Row.json` renders a nanosecond `TIMESTAMP_NTZ` / `TIMESTAMP_LTZ` column at the column precision (this PR is rebased on the merged #56392, which routes `Row.jsonValue` through `formatExternal`).\n- `CastWithAnsiOnSuite`, `CastWithAnsiOffSuite`, `ToPrettyStringSuite`, and `TimestampNanosRowSuite` stay green (281 tests pass), and `SparkThrowableSuite` (33 tests) confirms the removed error condition leaves the error framework consistent.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56355 from MaxGekk/nanos-typeframework-formatter.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "d2cbc7f33efc051f0f23ce39ee8e8d1313588b2a",
      "tree": "a6ae84f09084e7b8b363a637d2f2dac366521276",
      "parents": [
        "44984a6bfeca86e4f78d131609faee63072b05b1"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Wed Jun 10 14:53:50 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Wed Jun 10 14:53:50 2026 -0700"
      },
      "message": "[SPARK-57374][BUILD] Upgrade `netty-tcnative` to 2.0.78.Final\n\n### What changes were proposed in this pull request?\n\nThis PR aims to upgrade `netty-tcnative` to 2.0.78.Final.\n\n### Why are the changes needed?\n\nTo bring the latest bug fixes. `netty-tcnative` 2.0.78.Final restores the `linux-aarch64` artifact publication to Maven Central which had regressed since 2.0.72 due to the publishing plugin migration.\n- https://github.com/netty/netty-tcnative/milestone/115\n  - netty/netty-tcnative#978\n  - netty/netty-tcnative#981\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (claude-fable-5)\n\nCloses #56433 from dongjoon-hyun/SPARK-57374.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "44984a6bfeca86e4f78d131609faee63072b05b1",
      "tree": "57d5cdd60c9b47d78386484245bf8aa1b6433436",
      "parents": [
        "9eb44b3ee9fe57c76cd6310270193aa0922d0944"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Thu Jun 11 06:13:11 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Thu Jun 11 06:13:11 2026 +0900"
      },
      "message": "[SPARK-56995][SQL][DML][TESTS][FOLLOWUP] Fix AutoCdcScd1FullRefreshSuite by returning live table from SharedTablesInMemoryRowLevelOperationTableCatalog\n\n### What changes were proposed in this pull request?\nOverride loadTable in SharedTablesInMemoryRowLevelOperationTableCatalog to return the live table instance instead of a snapshot copy.\nSPARK-56995 (#56121) introduced a loadTable override in InMemoryRowLevelOperationTableCatalog that returns a deep copy of the table on every call. This is needed for DSv2 Transaction API cache validation semantics, but it breaks TRUNCATE TABLE and DROP TABLE IF EXISTS for tests using the shared catalog variant — the mutation is applied to a disposable copy, leaving the live table\u0027s data intact.\n\n### Why are the changes needed?\nTwo tests in `AutoCdcScd1FullRefreshSuite` have been failing since SPARK-56995 was merged:\n\n- \"full refresh wipes target rows and the auxiliary table for the refreshed flow\"\n- \"selective full refresh wipes only the requested target\u0027s auxiliary state\"\n\nhttps://github.com/apache/spark/actions/runs/27166004427/job/80196384953\n\nThe failures were not caught in SPARK-56995\u0027s CI because the pipelines module does not declare a dependency on sql or\ncatalyst, so its tests were not triggered by changes under sql/catalyst/.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nConfirmed all 3 tests in `AutoCdcScd1FullRefreshSuite` and all 44 tests in `MergeIntoDataFrameSuite` passed.\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Claude\n\nCloses #56378 from sarutak/fix-autocdc-full-refresh-suite.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "9eb44b3ee9fe57c76cd6310270193aa0922d0944",
      "tree": "615eec02426ae493682a3e02c190c05b4359ab2e",
      "parents": [
        "0e8a75f044e27ab2d6960a41a9841d99216207ac"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Jun 10 13:48:18 2026 -0700"
      },
      "committer": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Jun 10 13:48:18 2026 -0700"
      },
      "message": "[SPARK-57355][PYTHON] Fix __module__ check in udf profiler\n\n### What changes were proposed in this pull request?\n\nFor `getattr(chained_func, \"__module__\", \"\").startswith(\"pyspark.sql.worker.\")`, if `changed_func.__module__` is `None`, it will raise an exception. So we check the attribute.\n\n### Why are the changes needed?\n\nIn some rare cases, `chained_func.__module__` can be `None` - for example, if the function is created with `compile` or `exec`. We need to deal with that.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. This fixed a bug when users use profilers against UDFs without `__module__`.\n\n### How was this patch tested?\n\nA regression test is added.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56416 from gaogaotiantian/fix-datasource-profiler.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\n"
    },
    {
      "commit": "0e8a75f044e27ab2d6960a41a9841d99216207ac",
      "tree": "386af52a04b9a78cfda991b8d0719787f5cb47f8",
      "parents": [
        "a852aa3cd81dc23080f714ace207aeb5f7ca6dc8"
      ],
      "author": {
        "name": "Stevo Mitric",
        "email": "stevomitric2000@gmail.com",
        "time": "Wed Jun 10 21:03:54 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Wed Jun 10 21:03:54 2026 +0200"
      },
      "message": "[SPARK-57207][SQL][FOLLOWUP] Fix StackOverflowError when setting timestampNanosTypes.enabled via SparkConf\n\n### What changes were proposed in this pull request?\nThis PR deletes that validator and moves the requirement into the getter: `SQLConf.timestampNanosTypesEnabled` now returns true only when `spark.sql.types.framework.enabled` is also true.\n\n### Why are the changes needed?\nEnabling the flag the normal way crashes Spark:\n\n```scala\nval spark \u003d SparkSession.builder()\n  .master(\"local[1]\")\n  .config(\"spark.sql.timestampNanosTypes.enabled\", \"true\")\n  .getOrCreate()       // fine — options are applied lazily\nspark.sql(\"SELECT 1\")  // java.lang.StackOverflowError\n```\n\nThe validator runs while the session conf is being built (during `mergeSparkConf`), and `SQLConf.get` asks for that same conf which starts building it again, forever.\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nTests in this PR.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude Code (Fable 5)\n\nCloses #56431 from stevomitric/stevomitric/fix-nanos-conf-checkvalue-recursion.\n\nAuthored-by: Stevo Mitric \u003cstevomitric2000@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "a852aa3cd81dc23080f714ace207aeb5f7ca6dc8",
      "tree": "1ffb7538896cbac4d85e474924b195738978e2c3",
      "parents": [
        "bd046363da14d2235e20817c40dc949e99298c75"
      ],
      "author": {
        "name": "DB Tsai",
        "email": "dbtsai@dbtsai.com",
        "time": "Wed Jun 10 11:58:17 2026 -0700"
      },
      "committer": {
        "name": "DB Tsai",
        "email": "dbtsai@dbtsai.com",
        "time": "Wed Jun 10 11:58:17 2026 -0700"
      },
      "message": "[SPARK-57325][CONNECT] Stop streaming queries registered while the Connect session is closing\n\n### What changes were proposed in this pull request?\n\nThis PR fixes a race condition between registering a newly started streaming query and the owning Connect session being closed. The race strands two kinds of per-session resources, so the fix touches two server-side caches with the same publish-then-recheck pattern.\n\n#### 1. `SparkConnectStreamingQueryCache` (the query cache)\n\nWhen a Connect session is closed (`SessionHolder.close()`), it stops all of the session\u0027s streaming queries via `SparkConnectStreamingQueryCache.cleanupRunningQueries()`, which iterates over the query cache. As the existing code comment in `close()` notes, *\"there can be concurrent streaming queries being started.\"* A query that finishes `DataStreamWriter.start()` on another thread and is registered **after** that iteration is missed by the cleanup. It is left in the cache as an active entry holding a strong reference to the now-closed session, and is never stopped — so the driver cannot exit.\n\nThe fix makes `registerNewStreamingQuery` coordinate with session shutdown without introducing additional locking:\n\n1. **`SessionHolder.isClosing`** — a new `private[connect]` accessor exposing `closedTimeMs.isDefined`. `closedTimeMs` is set at the very start of `close()`, before any session resources (including streaming queries) are cleaned up, so it is a reliable \"session is shutting down\" signal.\n\n2. **Publish-then-recheck in `registerNewStreamingQuery`** — after the query is inserted into the cache, we re-check `sessionHolder.isClosing`. If the session is closing, we stop the query (asynchronously, so we don\u0027t block the caller) and drop the entry. Because `closedTimeMs` is set before `cleanupRunningQueries()` runs, and both the cache publish and `closedTimeMs` are volatile, every interleaving is covered:\n   - if we observe the session as closing, we stop and drop the query here;\n   - otherwise `close()` has not set `closedTimeMs` yet, so its `cleanupRunningQueries()` runs after our cache insertion and observes the entry we just published.\n\n   `StreamingQuery.stop()` is idempotent and `isActive`-guarded, so both sides firing is harmless.\n\n3. **Identity-based, stop-then-remove async cleanup** — when the recheck fires, we stop the query and remove the cache entry on a `Future` (since `stop()` may block). Two subtleties:\n   - **Stop before remove.** We drop the entry only *after* the query has actually stopped. Removing it first would discard the only server-side handle to a query that might still be running, re-introducing the very leak this guards against. If `stop()` throws, we leave the entry cached so `periodicMaintenance` (and `cleanupRunningQueries`) can still find and reap it once the query becomes inactive.\n   - **Match by query identity, not value equality.** Removal uses `queryCache.computeIfPresent` and only nulls the entry when `current.query eq query`. A plain `queryCache.remove(queryKey, value)` by `QueryCacheValue` case-class equality would *fail to remove* the entry if the maintenance thread had concurrently rewritten its `expiresAtMs` after observing the just-stopped query (the value no longer equals the one we inserted). Identity matching removes our entry regardless of such rewrites, while still never evicting a later replacement registered for the same key.\n\n4. **`isActive` check at insertion time** — the new entry\u0027s `expiresAtMs` is now derived from `query.isActive`. A query that is already inactive at registration time (e.g. a `Trigger.AvailableNow` query that already finished, or one stopped right after `start()`) gets an expiry time immediately, instead of lingering as a falsely \"active\" entry until a later maintenance cycle notices it stopped.\n\n#### 2. `StreamingForeachBatchHelper.CleanerCache` (the foreachBatch runner cache)\n\nThe same shutdown window applies to the Python `foreachBatch` runner cleaner, which is registered (via `registerCleanerForQuery`) immediately after the query is registered. `close()` reaps runners through `CleanerCache.cleanUpAll()`, and the `onQueryTerminated` listener is the other reaper. A cleaner registered for a query started concurrently with `close()` can be missed by `cleanUpAll()`, and if its query already terminated it can also be missed by the listener — stranding a Python worker, the same class of leak as the query case. This PR applies a symmetric guard:\n\n1. **Pre-insert fast path** — if `sessionHolder.isClosing` is already true on entry, we close the runner immediately and return without registering anything (in particular without adding the listener; see below).\n\n2. **Post-insert recheck** — after inserting the cleaner, we re-check `isClosing` and, if the session started closing in the meantime, clean the runner up here and drop the listener.\n\n3. **Listener-lifecycle rework** — the runner clean-up listener was a `lazy val` initialized on first query and never removed. Crucially, `SessionHolder.close()` does **not** remove this listener (it is not tracked in the session\u0027s `listenerCache`), so a listener left registered keeps the `CleanerCache` / `SessionHolder` reachable after the session is closed. It is now a `var` managed by `ensureListenerRegistered()` / `removeListenerIfRegistered()`, both guarded by `this`:\n   - `cleanUpAll()` removes the listener after reaping the runners;\n   - the post-insert recheck removes it too (the recheck may have just re-added it after `cleanUpAll()` ran);\n   - the field is **recoverable** — a later registration re-adds a fresh listener, so `cleanUpAll()` does not permanently disable the cache if it is ever reused. Today `cleanUpAll()` is only called on the close path (after which registration fast-paths on `isClosing`), but correctness no longer depends on that.\n   - `listenerForTesting` now reads the field under the same lock, so concurrent tests see a consistent value rather than a stale/torn read.\n\n### Why are the changes needed?\n\nThe race strands streaming queries: they keep running, hold a reference to a closed session, and are never reaped. In production this manifested as Connect structured-streaming jobs in multi-task workflows that complete their computation but never exit — the driver sits idle (0% CPU) for many hours until a manual cancellation or a much longer infra-level cleanup timeout, incurring significant unnecessary cost. The foreachBatch-cleaner guard closes the symmetric leak for Python `foreachBatch` workers. Fixing the registration/shutdown race lets the session-close path reliably account for every query and runner so the driver can terminate normally.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This fixes an internal resource-cleanup race in the Spark Connect server. There is no API or behavior change visible to users other than streaming queries (and their Python `foreachBatch` workers) no longer being stranded when a session is closed concurrently with a query starting.\n\n### How was this patch tested?\n\nAdded 7 unit tests across two suites.\n\n`SparkConnectStreamingQueryCacheSuite`:\n- `\"Query registered when the session is already closing is stopped and dropped\"`\n- `\"Query registered for a closing session is retained when stopping it fails\"`\n- `\"Query registration racing with session shutdown leaves no query running\"` (200-iteration race test)\n\n`StreamingForeachBatchHelperSuite`:\n- `\"CleanerCache: a runner registered for a closing session is cleaned up immediately\"`\n- `\"CleanerCache.cleanUpAll unregisters the streaming listener\"`\n- `\"CleanerCache: listener is recoverable -- re-registered after cleanUpAll\"`\n- `\"CleanerCache: registration racing with session shutdown strands no runner or listener\"` (200-iteration race test)\n\nThe existing happy-path tests continue to pass with the `isActive`-based expiry change (an active query is still cached with no expiry).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.8)\n\nCloses #56377 from dbtsai/spark-connect-fix.\n\nLead-authored-by: DB Tsai \u003cdbtsai@dbtsai.com\u003e\nCo-authored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: DB Tsai \u003cdbtsai@dbtsai.com\u003e\n"
    },
    {
      "commit": "bd046363da14d2235e20817c40dc949e99298c75",
      "tree": "fd936f4b8d6a97c93f4c2ccbc9af656b0c3fae12",
      "parents": [
        "0b0535233e7b97bf1b3a786cf866b77933e6fde0"
      ],
      "author": {
        "name": "Haiyang Sun",
        "email": "haiyang.sun@databricks.com",
        "time": "Wed Jun 10 08:46:18 2026 -0400"
      },
      "committer": {
        "name": "Herman van Hövell",
        "email": "herman@databricks.com",
        "time": "Wed Jun 10 08:46:18 2026 -0400"
      },
      "message": "[SPARK-57318][SQL] Refactor WorkerSession into a state-machine interface\n\n### What changes were proposed in this pull request?\n\nThis PR refactors the experimental UDF worker **session interface** in the `udf-worker-core` module into an explicit state machine with well-defined terminal outcomes, so that a concrete transport backend can plug in cleanly. (The first such backend -- a gRPC-over-UDS dispatcher -- lands in a follow-up PR.)\n\nCore interface changes:\n\n- **`WorkerSession`** becomes a single `AtomicReference`-driven state machine (`Created -\u003e Initializing -\u003e Initialized -\u003e Streaming -\u003e Finishing`, with `Cancelling`/`Finished`/`Cancelled`/`Failed`/`TransportFailed` terminals). `init` now returns an `InitResponse`, `process` takes a `finish` thunk, and `close` returns a `Termination` instead of throwing; the previous separate `cancel()` is folded into `close()`. Subclasses implement `doInit` / `doProcess` / `doClose` hooks and drive transitions via protected helpers.\n- **`Termination`** (new) is a sealed trait modeling the four terminal outcomes: `Finished(FinishResponse)`, `Cancelled(CancelResponse)`, `Failed(ExecutionError)`, and `TransportFailed(Throwable)`.\n- **`WorkerHandle`** (new) is a small trait that decouples a session from the concrete worker-provisioning model. `DirectWorkerProcess` implements it and gains a cleanup-hook registry, so a dispatcher can register transport-specific cleanup (e.g. deleting a socket file) without the core hard-coding it.\n- **`DirectWorkerDispatcher`** is generalized from a Unix-socket-specific class into abstract transport hooks (`newEndpointAddress` / `waitForReady` / `cleanupEndpointAddress` / `closeTransport` / `validateTransportSupport` / `newConnection` / `newSession` / `initialize`). The concrete Unix-socket dispatcher and session (`DirectUnixSocketWorkerDispatcher`, `DirectWorkerSession`) are removed -- the first concrete backend is provided by the follow-up gRPC change.\n- **`WorkerConnection`** becomes a trait.\n\nCaller/test changes:\n\n- `ExternalUDFExec` (sql/core) is updated to the new `close()` / `Termination` finalizer: a single task-completion listener now handles both the success path (a `FinishResponse` has already arrived) and the failure/early-stop path (send `Cancel`, await `CancelResponse`), replacing the previous separate cancel-on-failure listener.\n- The in-core test scaffolding (`TestDirectWorkerHelpers`, `DirectWorkerDispatcherSuite`) is adapted to the new API as a small UDS-backed test dispatcher, so `udf-worker-core` and `sql/core` stay green. This scaffolding moves to the `udf-worker-grpc` module alongside the concrete gRPC dispatcher in the follow-up PR.\n\nClass hierarchy after this change:\n\n```\nWorkerSession (abstract state machine; close(): Termination)\nWorkerHandle  (trait)  \u003c- DirectWorkerProcess\nDirectWorkerDispatcher (abstract transport hooks)\nWorkerConnection (trait)\nTermination \u003d Finished | Cancelled | Failed | TransportFailed\n```\n\n### Why are the changes needed?\n\nThe previous `WorkerSession` tracked its lifecycle with a pair of `AtomicBoolean`s and signaled completion/cancellation by throwing, which does not cleanly express the several terminal outcomes a worker session can reach (clean finish, cancellation, user/worker error, transport failure) nor make worker reuse decisions explicit. It was also tied to a single Unix-socket transport. Reshaping it into an explicit state machine with a `Termination` result and abstract transport hooks gives a precise, testable lifecycle contract and lets a concrete transport (gRPC over UDS, in the follow-up) implement a small set of hooks rather than re-deriving lifecycle and termination handling. `WorkerHandle` and the cleanup-hook registry remove the core\u0027s dependence on a specific provisioning model.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The UDF worker framework is experimental and currently unused by any released code path; this PR only reshapes its internal interfaces.\n\n### How was this patch tested?\n\nExisting and adapted unit tests, run on top of `master`:\n\n- `build/sbt udf-worker-core/Test/compile` and `build/sbt sql/Test/compile` succeed.\n- `build/sbt \"udf-worker-core/testOnly *DirectWorkerDispatcherSuite\"` -- the dispatcher process-lifecycle suite (spawn / wait-for-ready / cleanup / error / timeout / concurrency) passes (31 tests) against the adapted in-core UDS test dispatcher.\n- `PythonUDFWorkerSpecificationSuite` (sql/core) is updated to the new `close()` signature and continues to spawn a real Python worker and verify reachability.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes\n\nCloses #56397 from haiyangsun-db/pr2-udf-core-session-refactor.\n\nAuthored-by: Haiyang Sun \u003chaiyang.sun@databricks.com\u003e\nSigned-off-by: Herman van Hövell \u003cherman@databricks.com\u003e\n"
    },
    {
      "commit": "0b0535233e7b97bf1b3a786cf866b77933e6fde0",
      "tree": "82063cf7a3dc807c865332c6b1e5de3996fc88b4",
      "parents": [
        "19aec7afac4d32d36b46da0186426b0a4e4490bf"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Jun 10 19:37:20 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Wed Jun 10 19:37:20 2026 +0800"
      },
      "message": "[MINOR][PYTHON][TEST] Use assertEqual instead of assertTrue in PySpark tests\n\n### What changes were proposed in this pull request?\nThis PR replaces misuses of `assertTrue` with `assertEqual` in PySpark test suites:\n\n1. `self.assertTrue(a \u003d\u003d b)` -\u003e `self.assertEqual(a, b)` (53 call sites across 22 files). A mechanical, behavior-preserving change.\n2. `self.assertTrue(df.columns, [...])` -\u003e `self.assertEqual(df.columns, [...])` (4 call sites in `test_dataframe.py` and `test_column.py`). These passed the expected column list as the `assertTrue` *message* argument, so they only checked that `df.columns` is truthy and never compared the names. This is a bug fix.\n\n### Why are the changes needed?\n- For `assertTrue(a \u003d\u003d b)`: when the assertion fails, the message is only `AssertionError: False is not true`, which hides the operands. `assertEqual(a, b)` reports both sides (e.g. `5 !\u003d 9`), making failures easier to diagnose.\n- For `assertTrue(df.columns, [...])`: a non-empty list is always truthy, so the expected column names were never actually verified. `assertEqual` makes these tests check what they intended to.\n\n### Does this PR introduce _any_ user-facing change?\nNo. Test-only change.\n\n### How was this patch tested?\nExisting tests. The `a \u003d\u003d b` rewrites are equivalent to the originals (same `\u003d\u003d` comparison). For the column assertions, the expected lists match the actual DataFrame columns, so the now-effective assertions pass.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude Code (model: claude-opus-4-8)\n\nCloses #56396 from zhengruifeng/pyspark-test-assertequal-dev2.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "19aec7afac4d32d36b46da0186426b0a4e4490bf",
      "tree": "168f05d6db4290a747109d011475e2c99031e92d",
      "parents": [
        "cc88e6c778ddb400cf13d7cdd98307717e8e87ff"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Wed Jun 10 13:36:28 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Wed Jun 10 13:36:28 2026 +0200"
      },
      "message": "[SPARK-57315][SQL] Support HOUR, MINUTE and SECOND functions over nanosecond-precision timestamps\n\n### What changes were proposed in this pull request?\n\nThis PR lets `hour()`, `minute()` and `second()` operate on the nanosecond-precision timestamp types `TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` (`p` in `[7, 9]`), the preview types under the nanosecond timestamp umbrella (SPARK-56822).\n\nIt extends the function expression builders `HourExpressionBuilder`, `MinuteExpressionBuilder` and `SecondExpressionBuilder`. When the argument is a nanosecond-precision timestamp, the builder casts it down to the matching microsecond timestamp type and reuses the existing `Hour` / `Minute` / `Second` expressions:\n\n- `TimestampNTZNanosType(p)` -\u003e `TimestampNTZType`\n- `TimestampLTZNanosType(p)` -\u003e `TimestampType`\n\nThe cast keeps `epochMicros` and drops the sub-microsecond digits, which is lossless for these integer time-of-day fields. The conversion is centralized in a small `NanosTimestampCast.castToMicros` helper. `SecondWithFraction` (the `extract(SECOND)` path returning `DECIMAL(8, 6)`) is intentionally not covered because its result depends on the sub-microsecond digits.\n\nThis is an alternative implementation to #56366, which addresses the same JIRA with a dedicated analyzer rule (`ResolveTimestampNanosExpressions`) instead of changing the builders.\n\nCloses #56366\n\n### Why are the changes needed?\n\nTo support `HOUR` / `MINUTE` / `SECOND` over the nanosecond timestamp types as part of the nanosecond timestamp preview (SPARK-56822), reusing the existing microsecond expressions rather than introducing new ones.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but only when the preview flag `spark.sql.timestampNanosTypes.enabled` is set to `true` (default `false`). With the flag on, `hour`/`minute`/`second` now accept `TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` arguments; previously they failed analysis with a type-check error. Behavior for microsecond `TIMESTAMP` / `TIMESTAMP_NTZ`, `DATE`, strings, and `TIME` is unchanged.\n\nExample (flag enabled):\n```sql\nSELECT hour(TIMESTAMP_NTZ \u00272018-02-14 12:58:59.123456789\u0027); -- 12\nSELECT minute(TIMESTAMP_LTZ \u00272009-07-30 12:58:59.123456789\u0027); -- 58\n```\n\n### How was this patch tested?\n\n- New end-to-end golden tests in `timestamp-ntz-nanos.sql` / `timestamp-ltz-nanos.sql` covering: full nanosecond precision (`p \u003d 9`, via typed literals), explicit precision via `::` casts (`p \u003d 7/8`), NULLs, pre-epoch values (negative-epoch path), and `TIMESTAMP_LTZ` extraction across source time zones (`Asia/Kolkata`, `UTC`) while the session zone is `America/Los_Angeles`.\n- Added runnable nanosecond-literal examples to the `hour`/`minute`/`second` function docs, validated by `ExpressionInfoSuite` (\"check outputs of expression examples\") and `ExpressionsSchemaSuite`.\n- Updated the `hour`/`minute`/`second` scaladoc in `functions.scala`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56368 from MaxGekk/nanos-hour-builders.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "cc88e6c778ddb400cf13d7cdd98307717e8e87ff",
      "tree": "eba6e05bac42a383faa26ed5a0d2f030cb117c25",
      "parents": [
        "3ad5b64dbca11d6ba37bfdf6904e53ee2c00fd6a"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Jun 10 19:31:19 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Wed Jun 10 19:31:19 2026 +0800"
      },
      "message": "[SPARK-57367][PYTHON][DOC] Improve See Also cross-references in pyspark.sql.functions\n\n### What changes were proposed in this pull request?\n\nImprove `See Also` cross-references in `pyspark/sql/functions/builtin.py` across three categories:\n\n**Fix broken links** (referenced functions that don\u0027t exist):\n- `var_samp`: `std_samp` → `stddev_samp`\n- `var_pop`: `std_pop` → `stddev_pop`\n\n**Add new `See Also` sections** (functions that had none):\n\n| Category | Functions |\n|---|---|\n| Math aliases | `ceil` ↔ `ceiling`, `sign` ↔ `signum`, `log` ↔ `ln` |\n| String aliases | `lcase`, `ucase`, `length` / `char_length` / `character_length`, `printf` |\n| Try variants | `to_binary` ↔ `try_to_binary`, `to_number` ↔ `try_to_number` |\n| Aggregates | `avg`, `sum`, `median`, `count_distinct`, `uniform` |\n\n**Add missing symmetric cross-references** to existing `See Also` sections:\n\n| Function | Added |\n|---|---|\n| `lower`, `upper` | `lcase`, `ucase` (alias back-links) |\n| `variance` | `std` |\n| `getbit` | `bit_count` |\n| `get` | `try_element_at` |\n| `substr` | `locate` |\n| `day` | `weekday` |\n| `dayofyear` | `dayofweek` |\n| `weekday` | `dayofweek`, `dayofyear`, `dayofmonth` |\n| `date_add`, `dateadd` | `add_months` |\n| `date_diff`, `timestamp_diff` | `time_diff` |\n| `to_date` ↔ `try_to_date` | each other |\n| `to_timestamp_ltz`, `to_timestamp_ntz` | `try_to_timestamp` |\n| `try_to_timestamp` | `try_to_date`, `try_to_time` |\n| `bitmap_construct_agg` | `bitmap_and_agg` |\n\n### Why are the changes needed?\n\nThe broken links produce dead references in the generated API docs (e.g., https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.var_samp.html). The missing cross-references make it harder for users to discover related functions when browsing the API documentation.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Documentation-only changes.\n\n### How was this patch tested?\n\nVerified programmatically that all referenced function names resolve to actual definitions or re-exported names in `pyspark.sql.functions`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #56404 from zhengruifeng/pyspark-see-also-dev4.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "3ad5b64dbca11d6ba37bfdf6904e53ee2c00fd6a",
      "tree": "c33a406eebad9d6701a464579dd4bcf87a90d848",
      "parents": [
        "bbd7c46919a4546e4921acb6e64495a7d966bfb2"
      ],
      "author": {
        "name": "Shrirang Mhalgi",
        "email": "shrirangmhalgi@gmail.com",
        "time": "Wed Jun 10 11:57:36 2026 +0200"
      },
      "committer": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Wed Jun 10 11:57:36 2026 +0200"
      },
      "message": "[SPARK-57194][SQL] Add preOperatorOptimizationRules extension point to Optimizer\n\n### What changes were proposed in this pull request?\nAdd `preOperatorOptimizationRules` - a new optimizer extension point that runs in a `Once` batch between \"Aggregate\" and the main fixed-point \"Operator Optimization\" batch. Wired through `SparkSessionExtensions.injectPreOperatorOptimizationRule` and `BaseSessionStateBuilder`. The batch is a no-op when no rules are registered.\n\n### Why are the changes needed?\nCustom rules injected via `injectOptimizerRule` run inside the fixed-point batch alongside built-in rules (FoldablePropagation, ConstantFolding, PushDownPredicates). A rule that needs to observe the original plan shape (e.g., cross-side join predicates before they are folded into single-side constants) is silently defeated when a built-in rule transforms the plan first within the same fixed-point iteration.\n\nNone of the existing extension points cover this:\n\n- `extendedOperatorOptimizationRules` - same fixed-point batch\n- `postHocResolutionRules` - analyzer phase, too early\n- `earlyScanPushDownRules` - runs after optimization, scoped to scan pushdown\n- `preCBORules` - runs after operator optimization, too late\n\n### Does this PR introduce _any_ user-facing change?\nYes. New public API will be available: `SparkSessionExtensions.injectPreOperatorOptimizationRule(builder: RuleBuilder)`.\n\n### How was this patch tested?\n- New test in `SparkSessionExtensionSuite` verifying the injected rule is wired into the `preOperatorOptimizationRules` accessor.\n- All existing SparkSessionExtensionSuite tests pass.\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes.\n\nCloses #56277 from shrirangmhalgi/SPARK-57194-early-optimizer-rules.\n\nAuthored-by: Shrirang Mhalgi \u003cshrirangmhalgi@gmail.com\u003e\nSigned-off-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\n"
    },
    {
      "commit": "bbd7c46919a4546e4921acb6e64495a7d966bfb2",
      "tree": "12fa515cbe10a8e9a06509fbffffb1ff785f8bb2",
      "parents": [
        "6d4b71ef52b9b08bcfd56dbce1ac913a3c8bf2e7"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "pan3793@gmail.com",
        "time": "Wed Jun 10 16:13:35 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Wed Jun 10 16:13:35 2026 +0800"
      },
      "message": "[SPARK-57261][SQL] Allow to disable HashAggregateExec by config\n\n### What changes were proposed in this pull request?\n\nCurrently, Spark always prefers to use `HashAggregateExec` over `SortAggregateExec` if possible, this PR adds a config `spark.sql.execution.useHashAggregateExec` to allow users to disable `HashAggregateExec` explicitly.\n\n### Why are the changes needed?\n\nWe found some jobs fail with `HashAggregateExec` due to OOM (auto fallback logic does not work well), and it runs well with `SortAggregateExec`\n\n```\n26/06/04 18:47:30 ERROR [SIGTERM handler] CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM\n26/06/04 18:47:30 WARN [Executor task launch worker for task 9749.0 in stage 14.0 (TID 61758)] TaskMemoryManager: Failed to allocate a page (2147483648 bytes) for 0 times, try again.\njava.lang.OutOfMemoryError: Java heap space\n\tat org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:72)\n\tat org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:398)\n\tat org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:359)\n\tat org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)\n\tat org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:868)\n\tat org.apache.spark.unsafe.map.BytesToBytesMap.growAndRehash(BytesToBytesMap.java:991)\n\tat org.apache.spark.unsafe.map.BytesToBytesMap$Location.append(BytesToBytesMap.java:817)\n\tat org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBufferFromUnsafeRow(UnsafeFixedWidthAggregationMap.java:135)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.hashAgg_doConsume_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.hashAgg_doAggregateWithKeys_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown Source)\n\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:44)\n\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)\n\tat scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:593)\n\tat org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:195)\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:206)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:900)\n\tat org.apache.spark.executor.Executor$TaskRunner$$Lambda$709/0x00007f84474fd558.apply(Unknown Source)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nUT is tuned.\n\nAlso verified with a production case, `HashAggregateExec` vs `SortAggregateExec`\n\n\u003cimg width\u003d\"1764\" height\u003d\"346\" alt\u003d\"Xnip2026-06-04_21-23-43\" src\u003d\"https://github.com/user-attachments/assets/401d3ec2-2bdc-4be3-bc1c-9ab2c2543e1f\" /\u003e\n\n\u003cimg width\u003d\"1756\" height\u003d\"167\" alt\u003d\"Xnip2026-06-04_21-22-38\" src\u003d\"https://github.com/user-attachments/assets/84146634-1e71-4887-ad4b-e374df112441\" /\u003e\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56323 from pan3793/SPARK-57261.\n\nAuthored-by: Cheng Pan \u003cpan3793@gmail.com\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "6d4b71ef52b9b08bcfd56dbce1ac913a3c8bf2e7",
      "tree": "30020859d7ca23f91fe90b5ac74fa24f1197b5bc",
      "parents": [
        "e598b0c09d7b1a2982f5a60b34f7904cffe2c053"
      ],
      "author": {
        "name": "YangJie",
        "email": "yangjie01@baidu.com",
        "time": "Wed Jun 10 16:02:14 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Wed Jun 10 16:02:14 2026 +0800"
      },
      "message": "[SPARK-57344][INFRA] Ensure tests for `pipelines` module triggered when sql-related modules are modified\n\n### What changes were proposed in this pull request?\n\nThis PR marks `pipelines` as depending on the `sql` module in Spark\u0027s test module graph, and updates the existing doctest expectations for the affected module closure.\n\n### Why are the changes needed?\n\nThe `pipelines` project depends on Spark SQL, but the test module graph treated it as independent. As a result, changes in SQL-related modules such as `sql`, `catalyst`, and `sql-api` did not select `pipelines/test` in the affected-module CI path.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n```bash\npython3 -m py_compile dev/sparktestsupport/modules.py dev/sparktestsupport/utils.py\nPYTHONPATH\u003ddev python3 -m doctest dev/sparktestsupport/utils.py\ngit diff --check\n```\n\nAlso verified locally that changes in `core`, `sql-api`, `catalyst`, `sql`, and `sql/pipelines` include the `pipelines` module in the affected-module selection.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: OpenAI Codex\n\nCloses #56405 from LuciferYang/fix-pipelines-module-deps.\n\nAuthored-by: YangJie \u003cyangjie01@baidu.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "e598b0c09d7b1a2982f5a60b34f7904cffe2c053",
      "tree": "4b82ab954340ed028187c141d98d353ea93b82d2",
      "parents": [
        "62ae4db28f3be8a0ca2c3016d27ca5a62f02915d"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Wed Jun 10 17:01:49 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Wed Jun 10 17:01:49 2026 +0900"
      },
      "message": "[SPARK-57348][PYTHON][TESTS] Replace sql_keywords doctest show() with columns check\n\n### What changes were proposed in this pull request?\n\nReplace the `show()`-based doctest for `sql_keywords()` with a `.columns` check.\n\n### Why are the changes needed?\n\nSPARK-57133 (#56247) added 7 new non-reserved keywords (BIN, WIDTH, ALIGN, etc.), which changed the top-20 row output of `sql_keywords().show()` and consequently the column width in the formatted output. This broke the `pyspark-connect-old-client` CI job, which runs `branch-4.0` client tests against a `master` server. The `branch-4.0` doctest still expects the old column width.\n\nhttps://github.com/sarutak/spark/actions/runs/27188469096/job/80265105548\n\n```\n**********************************************************************\nFile \"/__w/spark/spark-4.0/python/pyspark/sql/connect/tvf.py\", line ?, in pyspark.sql.connect.tvf.TableValuedFunction.sql_keywords\nFailed example:\n    spark.tvf.sql_keywords().show()\nExpected:\n    +-------------+--------+\n    |      keyword|reserved|\n    +-------------+--------+\n    ...\n    +-------------+--------+...\nGot:\n    +----------+--------+\n    |   keyword|reserved|\n    +----------+--------+\n    |       ADD|   false|\n    |     AFTER|   false|\n    | AGGREGATE|   false|\n    |     ALIGN|   false|\n    |       ALL|   false|\n    |     ALTER|   false|\n    |    ALWAYS|   false|\n    |   ANALYZE|   false|\n    |       AND|   false|\n    |      ANTI|   false|\n    |       ANY|   false|\n    | ANY_VALUE|   false|\n    |    APPROX|   false|\n    |   ARCHIVE|   false|\n    |     ARRAY|   false|\n    |        AS|   false|\n    |       ASC|   false|\n    |ASENSITIVE|   false|\n    |        AT|   false|\n    |    ATOMIC|   false|\n    +----------+--------+\n    only showing top 20 rows\n**********************************************************************\n   1 of   1 in pyspark.sql.connect.tvf.TableValuedFunction.sql_keywords\n***Test Failed*** 1 failures.\n```\n\nThe `show()` output is inherently fragile for this TVF because any keyword addition changes the formatting. Since a dedicated unittest (`test_sql_keywords` in `test_tvf.py`) already verifies the full output via `assertDataFrameEqual`, the doctest only needs to confirm that the method returns a valid DataFrame. Using `.columns` achieves this without being sensitive to keyword list changes.\n\nCurrently the only branch affected in CI is `branch-4.0` (via `pyspark-connect-old-client`), but this change is made on `master` for consistency and will be backported to older branches.\n\n### Does this PR introduce *any* user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting `test_sql_keywords` unittest continues to pass.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nKiro CLI / Claude\n\nCloses #56406 from sarutak/fix-sql-keywords-doctest.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "62ae4db28f3be8a0ca2c3016d27ca5a62f02915d",
      "tree": "0f12a95dfdc1f3cf0b3280453123af9a6aa3c195",
      "parents": [
        "844f6f0dbdcd85c804b87cf08585d6f3b626f621"
      ],
      "author": {
        "name": "AnishMahto",
        "email": "anish.mahto99@gmail.com",
        "time": "Tue Jun 09 23:45:07 2026 -0700"
      },
      "committer": {
        "name": "Jose Torres",
        "email": "jtorres@apache.org",
        "time": "Tue Jun 09 23:45:07 2026 -0700"
      },
      "message": "[SPARK-57152][SDP] Implement SCD2 Batch Processor; Find Affected Aux/Target Table Rows\n\nApproved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7\n\n--------\n\n### What changes were proposed in this pull request?\n**Preamble**:\n\nThe SCD type 2 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD2 replication semantics.\n\nSCD2 flows also maintain an \"auxiliary\" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table\u0027s state appropriately for future microbatches.\n\n**Find Affected Aux/Target Table Rows**\n\nAfter preprocessing the microbatch such that we have each incoming row\u0027s startAt, endAt, and recordStartAt projected, the next step in reconciliation is determining which existing rows in the auxiliary and target tables either might be affected by the incoming rows or they might affect the incoming rows themselves.\n\nA no-op upsert run row in the auxiliary table can be affected by the microbatch if an incoming row makes the row no longer a no-op (i.e microbatch delivers an interleaving row that does indeed change history tracked columns). A tombstone in the auxiliary table can affect an incoming row if it now matches against an upsert in the microbatch.\n\nA row in the target table can be affected by the microbatch if an incoming upsert makes the target table\u0027s row a no-op upsert, or an incoming delete/upsert event terminates an existing row in the target table. An active row (endAt\u003dnull) in the target table could become terminated, or an existing closed row in the target table could become bisected. Conversely, existing rows in the target table can dictate when an incoming upsert row should be considered closed from.\n\nWe take a practical, conservative approach in selecting the set of rows that could possibly be affected or affect the microbatch. Per key we retrieve all existing rows whose startAt comes after the youngest sequence in the incoming microbatch, as well as the first existing row in both the target/aux that comes before the youngest sequence.\n\nThis is opposed to doing a very complex and expensive join to determine which rows are definitively affected by/affecting the microbatch. In practice its not common for events to actually receive very old events out of order, so pulling in all existing rows that come after the oldest row in the microbatch will generally be a very small result set.\n\n### Why are the changes needed?\nAutoCDC SCD2 core algorithm.\n\n### Does this PR introduce _any_ user-facing change?\nNo, new feature.\n\n### How was this patch tested?\nUnit tested in `Scd2BatchProcessorSuite`.\n\n### Was this patch authored or co-authored using generative AI tooling?\nCo-authored with Claude Opus 4.7.\n\nCloses #56283 from AnishMahto/SPARK-57152-SCD2-find-affected-rows.\n\nAuthored-by: AnishMahto \u003canish.mahto99@gmail.com\u003e\nSigned-off-by: Jose Torres \u003cjtorres@apache.org\u003e\n"
    },
    {
      "commit": "844f6f0dbdcd85c804b87cf08585d6f3b626f621",
      "tree": "0198d087d4c9e7012c9a678e982bbcea139dc042",
      "parents": [
        "5001ba0b99698ac133bb71b9e43a31665ab45046"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Wed Jun 10 08:37:26 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Wed Jun 10 08:37:26 2026 +0200"
      },
      "message": "[SPARK-57338][SQL] Render external values in Row JSON via formatExternal\n\n### What changes were proposed in this pull request?\n\nThis PR fixes `Row.json` / `Row.prettyJson` for `TIME` (`TimeType`) columns by rendering the row\u0027s external value through the Types Framework\u0027s `formatExternal` instead of the internal-value `format`.\n\n- `Row.jsonValue`\u0027s `toJson` now renders a framework type\u0027s value via `TypeApiOps(dt).formatExternal(value)`. A framework type either returns a rendered string or raises its own error; types outside the framework fall back to the legacy `toJsonDefault`.\n- `TimeTypeApiOps.formatExternal` now formats the external `java.time.LocalTime` directly through the time formatter. It previously converted to nanos-of-day first; that was an identity round-trip (`localTimeToNanos` then `nanosToLocalTime`), so the rendered output is unchanged.\n- The nanosecond timestamp ops override the single-arg `formatExternal(value)` to raise `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` directly (the same error their `format` raises), so Row JSON keeps reporting that the rendering is unsupported instead of mis-rendering the external value. With that, `Row.toJson` no longer needs to fall back to the internal-value `format`.\n- The nanosecond timestamp ops also override the two-arg `formatExternal(value, nested)` to return `None`. That overload is the entry point used by `HiveResult.toHiveString`, which is zone-aware and renders nanosecond timestamps through its own default formatter; returning `None` lets it fall through to that path instead of inheriting the single-arg overload\u0027s unsupported-rendering error. Without this override the Hive/SQL output path would regress (it previously relied on `formatExternal` returning `None`). **This is a temporary split until nanos external rendering is unified across the zone-less (Row JSON) and zone-aware (Hive) paths.**\n- `toJsonDefault` now also handles the external `java.time.LocalTime` for a `TimeType` column, so `Row.json` renders a TIME column even with the Types Framework off (the production default), where `TypeApiOps` returns `None` and the value falls into the legacy path. This mirrors `HiveResult.toHiveStringDefault`\u0027s existing legacy fallback; previously this path threw `FAILED_ROW_TO_JSON`.\n- Updated the `TypeApiOps` trait scaladoc for the two `formatExternal` overloads to name both consumers (single-arg \u003d Row JSON, two-arg \u003d `HiveResult`) and define the `Some` / `None` / throw contract.\n\nRoot cause: a public `Row` holds *external* values (e.g. `java.time.LocalTime` for `TimeType`), but `toJson` called the internal-value `format`, which does `value.asInstanceOf[Long]` and threw `ClassCastException`. On the framework-off legacy path the external `LocalTime` instead hit the `toJsonDefault` fallthrough and threw `FAILED_ROW_TO_JSON`.\n\n### Why are the changes needed?\n\nWith the Types Framework enabled, serializing a row that has a `TIME` column to JSON fails:\n\n```scala\nimport java.time.LocalTime\nimport org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema\nimport org.apache.spark.sql.types._\n\nval row \u003d new GenericRowWithSchema(\n  Array(LocalTime.of(12, 13, 14)),\n  new StructType().add(\"a\", TimeType()))\n\nrow.json\n```\n\nthrows:\n\n```\njava.lang.ClassCastException: class java.time.LocalTime cannot be cast to class java.lang.Long\n```\n\nbecause the JSON path passed the external `LocalTime` to the internal-value `format`. The framework already exposes `formatExternal` (the external-value entry point, as `HiveResult` uses); Row JSON should use it.\n\nThe same `Row.json`-on-`TIME` failure also exists with the Types Framework off (the production default): there `TypeApiOps` returns `None`, the external `LocalTime` falls into `toJsonDefault`, which only matched the internal `Long`, and the row failed with `FAILED_ROW_TO_JSON` - even though `HiveResult` rendered the same value fine via its legacy fallback. This PR fixes both paths.\n\nThis also closes the coverage gap noted in #56355: `Row.jsonValue` was not exercised for framework types.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. `Row.json` / `Row.prettyJson` on a `TIME` column previously failed (with `ClassCastException` under the Types Framework, or `FAILED_ROW_TO_JSON` with it off); it now renders the value regardless of the flag, e.g.:\n\n```\n{\"a\":\"12:13:14\"}\n```\n\nThe nanosecond timestamp types are unaffected: Row JSON keeps raising `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` (via the single-arg `formatExternal`), and the Hive/SQL output path keeps rendering nanosecond timestamps as before (via the two-arg `formatExternal` returning `None`).\n\n### How was this patch tested?\n\nAdded `RowJsonSuite` tests (tagged `SPARK-57338`):\n- `TIME` column rendering: plain, fractional (microsecond resolution, trailing zeros trimmed), and midnight.\n- Rendering is independent of the `TimeType` precision.\n- A null `TIME` column renders as JSON `null`.\n- `TIME` column rendering on the legacy path with the Types Framework disabled (plain, fractional, and null).\n- A guard that nanosecond timestamp columns still raise `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` in Row JSON.\n\nI also confirmed the new `TIME` test fails on the pre-fix code with the expected `ClassCastException`, and re-ran the affected suites (`RowJsonSuite`, `HiveResultSuite`, and `SQLQueryTestSuite` for `timestamp-ltz-nanos.sql` / `timestamp-ntz-nanos.sql`) to confirm the Hive/SQL nanos rendering path stays green. `./dev/scalastyle` passes.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56392 from MaxGekk/row-json-time-formatexternal.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "5001ba0b99698ac133bb71b9e43a31665ab45046",
      "tree": "fe0a4e2005e2641822dc61a889ba13e9c9a88c29",
      "parents": [
        "fc527bc70dad94dd89515204fd74aafc3ad9f3de"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Jun 10 09:09:51 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Wed Jun 10 09:09:51 2026 +0800"
      },
      "message": "[MINOR][PYTHON][DOC] Fix broken See Also links in pyspark.sql.functions\n\n### What changes were proposed in this pull request?\n\nFix two broken `See Also` cross-references in `pyspark/sql/functions/builtin.py`:\n\n- `var_samp`: `std_samp` → `stddev_samp`\n- `var_pop`: `std_pop` → `stddev_pop`\n\nThe functions `std_samp` and `std_pop` do not exist in `pyspark.sql.functions`. The correct names are `stddev_samp` and `stddev_pop`, which are the sample and population standard deviation functions respectively.\n\n### Why are the changes needed?\n\nThe broken links produce dead references in the generated API documentation. For example, the current published docs for `var_samp` show a broken `See Also` entry:\nhttps://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.var_samp.html?highlight\u003dvar_samp#pyspark.sql.functions.var_samp\n\n\u003cimg width\u003d\"1089\" height\u003d\"243\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/9b9df766-8688-4427-82aa-24b36629069f\" /\u003e\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Documentation-only fix.\n\n### How was this patch tested?\n\nNo tests needed for doc-only fixes. Verified programmatically that `stddev_samp` and `stddev_pop` are valid function names defined in `builtin.py`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #56394 from zhengruifeng/fix-pyspark-broken-seealso.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "fc527bc70dad94dd89515204fd74aafc3ad9f3de",
      "tree": "29d9dd413c3267b8841bf0e0ae13d453f08cd6fb",
      "parents": [
        "af39d9580efd675aecf8965b443bb404fabd1281"
      ],
      "author": {
        "name": "Thang Long Vu",
        "email": "long.vu@databricks.com",
        "time": "Tue Jun 09 17:20:53 2026 -0700"
      },
      "committer": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Tue Jun 09 17:20:53 2026 -0700"
      },
      "message": "[SPARK-57316][DOC] Document WITH SCHEMA EVOLUTION and BY NAME for SQL INSERT\n\n### What changes were proposed in this pull request?\n\n[SPARK-54971](https://issues.apache.org/jira/browse/SPARK-54971) (#53732) added the `WITH SCHEMA EVOLUTION` syntax to the SQL `INSERT` command, but did not add user-facing documentation for it. This PR documents the new clause on the `INSERT TABLE` SQL reference page (`docs/sql-ref-syntax-dml-insert-table.md`):\n\n- Adds the optional `[ WITH SCHEMA EVOLUTION ]` clause to both syntax forms, mirroring the grammar where the clause appears right after `INSERT` and before `INTO` / `OVERWRITE`.\n- Adds a `WITH SCHEMA EVOLUTION` entry to the Parameters section describing the behavior.\n- Adds a `BY NAME` entry to the Parameters section noting that columns and nested fields are matched by position by default, or by name when `BY NAME` is specified.\n- Documents `BY NAME` support for `INSERT INTO ... REPLACE WHERE` (added in #53567): updates the syntax form to `INSERT [ WITH SCHEMA EVOLUTION ] INTO [ TABLE ] table_identifier [ BY NAME ] REPLACE WHERE boolean_expression query` and adds an \"Insert By Name Using a REPLACE WHERE Statement\" example.\n\n### Why are the changes needed?\n\nThe `WITH SCHEMA EVOLUTION` syntax for SQL `INSERT` was introduced without documentation. This page is the reference for the `INSERT` statement, so the new clause should be documented there.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Documentation-only update.\n\n### How was this patch tested?\n\nDocs-only change. Reviewed the rendered Markdown for correctness.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Anthropic)\n\nCloses #56370 from longvu-db/insert-schema-evolution-docs.\n\nAuthored-by: Thang Long Vu \u003clong.vu@databricks.com\u003e\nSigned-off-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\n"
    },
    {
      "commit": "af39d9580efd675aecf8965b443bb404fabd1281",
      "tree": "479a3c5969ce8f35917b171a4a92d2892da71fad",
      "parents": [
        "2bb8b200044d96adf13565ef493736e39ce13f4e"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Jun 09 15:28:10 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Jun 09 15:28:10 2026 -0700"
      },
      "message": "Revert \"[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution\"\n\n### What changes were proposed in this pull request?\n\nThis reverts commit 761afcb676189a4ef58439441da03ba73aed1e21 to recover CI.\n\nWe can land 761afcb676189a4ef58439441da03ba73aed1e21 back after merging the following PR first.\n- https://github.com/apache/spark/pull/56406\n\nPlease see the CI failure details in the above PR.\n\n### Why are the changes needed?\n\nTo recover the CIs. The original PR is not a breaking change, but it was a CI-breaking commit unfortunately.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, this is not released yet.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56413 from dongjoon-hyun/SPARK-57133.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "2bb8b200044d96adf13565ef493736e39ce13f4e",
      "tree": "2b36d8bdfad33eaf93511472ec52a117ac45f294",
      "parents": [
        "6c2325a0cde405246b80d77b407a0d487b0fb52f"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Jun 09 14:24:32 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Jun 09 14:24:32 2026 -0700"
      },
      "message": "[SPARK-57351][K8S][CORE] Enable `spark.kubernetes.executor.useDriverPodIP` by default\n\n### What changes were proposed in this pull request?\n\nThis PR sets the default value of `spark.kubernetes.executor.useDriverPodIP` to `true` at Apache Spark 4.3.0.\n\n### Why are the changes needed?\n\nIntroduced in SPARK-53944 (4.1.0), this option lets executor pods connect to the driver pod IP directly instead of the driver\u0027s Kubernetes `Service`, bypassing the known K8s DNS issue. Since a Spark driver pod is not restarted during an application, its IP is stable, making this a safe default that avoids DNS lookups.\n- https://github.com/apache/spark/pull/52650\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. By default, executor pods now connect to the driver via the driver pod IP instead of the driver `Service` DNS name. To restore the legacy behavior, set `spark.kubernetes.executor.useDriverPodIP` to `false`.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56412 from dongjoon-hyun/SPARK-57351.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "6c2325a0cde405246b80d77b407a0d487b0fb52f",
      "tree": "a6503dbca36795450bf49010efc0cc17c02e5edf",
      "parents": [
        "522736869aae4da5a8a1c6605b15d7d6f51f0d6c"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Tue Jun 09 20:52:40 2026 +0000"
      },
      "committer": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Tue Jun 09 20:52:40 2026 +0000"
      },
      "message": "[SPARK-56758][PYTHON] Refactor SQL_MAP_PANDAS_ITER_UDF\n\n### What changes were proposed in this pull request?\n\nConsolidate the `SQL_MAP_PANDAS_ITER_UDF` (`mapInPandas`) execution path so that input transformation, UDF invocation, result verification, and output transformation live in one block in `read_udfs()`.\n\n### Why are the changes needed?\n\nPart of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388). The full data flow for `mapInPandas` is now visible in one place.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests. No behavior change.\n\nASV `MapPandasIterUDFTimeBench` comparison (`-a repeat\u003d3`, before \u003d `upstream/master`, after \u003d this branch):\n\n```text\nscenario             udf            before (ms)   after (ms)   diff (%)\nsm_batch_few_col     identity_udf       311            310         -0.3\nsm_batch_few_col     sort_udf           365            367         +0.5\nsm_batch_few_col     filter_udf         347            341         -1.7\nsm_batch_many_col    identity_udf       213            212         -0.5\nsm_batch_many_col    sort_udf           231            233         +0.9\nsm_batch_many_col    filter_udf         216            217         +0.5\nlg_batch_few_col     identity_udf       850            765        -10.0\nlg_batch_few_col     sort_udf          1030            965         -6.3\nlg_batch_few_col     filter_udf         815            804         -1.3\nlg_batch_many_col    identity_udf       791            787         -0.5\nlg_batch_many_col    sort_udf          1290           1280         -0.8\nlg_batch_many_col    filter_udf         889            811         -8.8\npure_ints            identity_udf       152            140         -7.9\npure_ints            sort_udf           168            166         -1.2\npure_ints            filter_udf         164            151         -7.9\npure_floats          identity_udf       158            139        -12.0\npure_floats          sort_udf           183            166         -9.3\npure_floats          filter_udf         179            161        -10.1\npure_strings         identity_udf       636            609         -4.2\npure_strings         sort_udf           894            799        -10.6\npure_strings         filter_udf         762            650        -14.7\npure_ts              identity_udf       267            211        -21.0\npure_ts              sort_udf           303            230        -24.1\npure_ts              filter_udf         273            233        -14.7\nmixed_types          identity_udf       571            435        -23.8\nmixed_types          sort_udf           668            518        -22.5\nmixed_types          filter_udf         507            460         -9.3\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55750 from Yicong-Huang/SPARK-56758.\n\nAuthored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\n"
    },
    {
      "commit": "522736869aae4da5a8a1c6605b15d7d6f51f0d6c",
      "tree": "0241aa3fdf3baf89b4d5f20401a348567708d01c",
      "parents": [
        "05b4635eb0d77677c70dc2bf008142c9abb9d4ee"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Jun 09 13:03:32 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Jun 09 13:03:32 2026 -0700"
      },
      "message": "[SPARK-57212][SQL][FOLLOWUP] Record AQE rule timing into the shared tracker via a lock instead of per-node trackers\n\n### What changes were proposed in this pull request?\n\nThis is a follow-up of #56275, addressing https://github.com/apache/spark/pull/56275#discussion_r3376088031.\n\nThe original PR routes physical preparation and AQE rules through `RuleExecutor` so their per-rule timing lands in `QueryPlanningTracker`. To avoid concurrent writes to a single tracker from the threads that plan scalar / IN / DPP subqueries, it gave each `AdaptiveSparkPlanExec` its own `QueryPlanningTracker`, registered them in `AdaptiveExecutionContext.planningTrackers`, and folded them all into the query\u0027s shared tracker (`context.qe.tracker`) in `finalPlanUpdate` via a new `QueryPlanningTracker.merge`.\n\nThis PR replaces that machinery with a lock on the single shared tracker. The only mutation point is `QueryPlanningTracker.recordRuleInvocation`; making it (and the `rules` / `topRulesByTime` read accessors) `synchronized` lets every AQE node -- main and subquery -- record straight into the single `context.qe.tracker`, which is already reachable from every node (sub-AQEs reuse the same `AdaptiveExecutionContext`).\n\nThat lets the following pieces, which existed solely to avoid concurrent writes, be deleted:\n- the per-`AdaptiveSparkPlanExec` `tracker` field,\n- `AdaptiveExecutionContext.planningTrackers`,\n- `QueryPlanningTracker.merge`,\n- the merge loop in `finalPlanUpdate`, and\n- the `if (!isSubquery)` special-case there.\n\n`PhysicalRuleExecutor`, the `applyPhysicalRules` signature change, and the `withTracker` wrapping from the original PR are unchanged.\n\n### Why are the changes needed?\n\nBeyond the reduced surface area, the lock makes correctness local:\n- As written, the merge is safe only because sub-AQEs complete (via `waitForSubqueries` -\u003e `awaitResult`) before `finalPlanUpdate` reads their trackers, and because `finalPlanUpdate` is a `lazy val` that runs once. That is a non-obvious chain. A lock makes the guarantee local to the one mutating method and independent of when subqueries complete or when `finalPlanUpdate` runs, so a future change to subquery scheduling or plan finalization cannot silently reintroduce a race.\n- Any future code that records a rule from another thread is automatically covered by the lock; with per-node trackers each new path would have to wire up its own tracker plus merge.\n- The cost is negligible: recording is one `HashMap` put per rule invocation, so an uncontended monitor on the single-threaded analyzer / optimizer path is tens of ns against ms-scale planning, and it is contended only by the rare concurrent subquery-planning threads.\n- A shared tracker reflects AQE rules as they run, rather than only after `finalPlanUpdate` at execution time.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is an internal refactor of how the same per-rule timing is recorded; `tracker.rules`, `tracker.topRulesByTime(...)`, and `RuleExecutor.dumpTimeSpent()` cover the same rules as after #56275.\n\n### How was this patch tested?\n\nExisting tests in `QueryPlanningTrackerEndToEndSuite` (added by #56275), including `SPARK-57212: Track sub-query AQE rules`, which exercises the cross-thread subquery-recording path, continue to pass.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8\n\nCloses #56410 from cloud-fan/SPARK-57212-tracker-lock-followup.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "05b4635eb0d77677c70dc2bf008142c9abb9d4ee",
      "tree": "538ea64ae47314e8578e08a0de1d2aeda71fb290",
      "parents": [
        "029731cc0fb92123cc51c233f3112a32f4c05af8"
      ],
      "author": {
        "name": "Anupam Yadav",
        "email": "anupamya@amazon.com",
        "time": "Tue Jun 09 12:03:09 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Jun 09 12:03:09 2026 -0700"
      },
      "message": "[SPARK-57148][SQL] Rename splitSemiColonWithIndex to splitSemiColon\n\n### What changes were proposed in this pull request?\n\nRename `splitSemiColonWithIndex` to `splitSemiColon` in `StringUtils`. The method no longer returns indices (it returns `List[String]`), so the name is misleading.\n\nFollow-up to #55466 as suggested by cloud-fan.\n\n### Why are the changes needed?\n\nThe method was originally named `splitSemiColonWithIndex` when it returned index information. After the structural scanner refactor in SPARK-54876, it returns `List[String]` with no index information. The name should match the return type.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is an internal API rename (package-private method).\n\n### How was this patch tested?\n\nExisting tests pass. No behavioral change.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes.\n\nCloses #56386 from yadavay-amzn/fix/SPARK-57148-rename-splitSemiColon.\n\nAuthored-by: Anupam Yadav \u003canupamya@amazon.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "029731cc0fb92123cc51c233f3112a32f4c05af8",
      "tree": "1435708246bdd466dedf7aae6f2a3077b3c3f2e5",
      "parents": [
        "e94782ca65ebb6ae826fae734b380617a08f32db"
      ],
      "author": {
        "name": "Haiyang Sun",
        "email": "haiyang.sun@databricks.com",
        "time": "Tue Jun 09 12:00:27 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Jun 09 12:00:27 2026 -0700"
      },
      "message": "[SPARK-57349][CONNECT] Split udf protocol into message and grpc service.\n\n### What changes were proposed in this pull request?\n\nThis PR splits the UDF worker protocol definition into two `.proto` files in the `udf-worker-proto` module:\n\n- `udf_protocol.proto` is renamed to `udf_message.proto` and now contains **only** the protocol\u0027s message types (the `Execute`-stream envelopes, `Init`/`Finish`/`Cancel`/`Data` request and response messages, `ErrorResponse`, etc.).\n- A new `udf_service.proto` holds the gRPC **service** definition -- `service UdfWorker` with its `Execute(stream UdfRequest) returns (stream UdfResponse)` and `Manage(WorkerRequest) returns (WorkerResponse)` RPCs -- and imports `udf_message.proto`.\n\nThis is a pure reorganization of the protocol definitions. There is no change to message types, field numbers, or generated message classes, and no Scala/behavior change.\n\n### Why are the changes needed?\n\nClarity between message definition and grpc service.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The protocol is part of the experimental, currently-unused UDF worker interface, and this PR only reorganizes the `.proto` source layout without altering the generated message classes.\n\n### How was this patch tested?\n\nExisting build/codegen coverage. Verified `build/sbt udf-worker-proto/compile` succeeds -- the proto module compiles all four `.proto` files (including the relocated service definition) into `protobuf-java` message classes with no gRPC dependency, confirming the split is well-formed.\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes\n\nCloses #56391 from haiyangsun-db/pr1-udf-proto-split.\n\nAuthored-by: Haiyang Sun \u003chaiyang.sun@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "e94782ca65ebb6ae826fae734b380617a08f32db",
      "tree": "2a84de50ab0a54b1811959438f1e5e12bd2af295",
      "parents": [
        "098057d963495e91e5093371ee8a3543f709df55"
      ],
      "author": {
        "name": "pranavdev022",
        "email": "pranavdev022@gmail.com",
        "time": "Tue Jun 09 13:34:58 2026 -0400"
      },
      "committer": {
        "name": "Herman van Hövell",
        "email": "herman@databricks.com",
        "time": "Tue Jun 09 13:34:58 2026 -0400"
      },
      "message": "[SPARK-56538][CONNECT] Add per-RPC deadlines to Spark Connect client\n\n### What changes were proposed in this pull request?\n\nIntroduce a `RpcDeadlines` configuration class (Scala `case class`, Python `dataclass`) that assigns per-RPC gRPC deadlines to every Spark Connect client call. Each field controls the timeout for one RPC type and can be individually set to `None` to disable.\n\n**Defaults:**\n\n| RPC | Default |\n|-----|---------|\n| reattachableExecutePlan, reattachExecute | 10 minutes |\n| analyzePlan, addArtifacts | 1 hour |\n| config, interrupt, releaseSession, artifactStatus, cloneSession, getStatus, fetchErrorDetails | 10 minutes |\n| Non-reattachable ExecutePlan | None (no deadline) |\n\n### Why are the changes needed?\n\nThe Spark Connect client currently has no per-RPC timeouts. If a network connection silently dies (load balancer drops an idle connection, firewall closes a stale TCP socket, server becomes unreachable), the client hangs indefinitely with no error or feedback. This is particularly problematic for long-lived streaming responses on the reattachable execute path, where the client expects a continuous stream that may go silent without any TCP-level indication of failure.\n\nPer-RPC deadlines act as a last-resort kill mechanism: if no response arrives within the deadline window, gRPC raises `DEADLINE_EXCEEDED` on the client side. On the reattachable path, the client transparently opens a fresh `ReattachExecute` stream (the server-side operation continues running). On unary RPCs, the error surfaces to the user with a hint about how to adjust or disable deadlines.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. All existing clients will get default deadlines on upgrade. Any call that previously hung indefinitely will now fail with `DEADLINE_EXCEEDED` after the configured timeout, accompanied by an error message explaining how to configure or disable deadlines via `RpcDeadlines`.\n\nUsers can:\n- Adjust individual deadlines: `SparkConnectClient.builder().rpcDeadlines(RpcDeadlines(analyzePlan \u003d Some(2.hours))).build()`\n- Disable all deadlines: `SparkConnectClient.builder().rpcDeadlines(RpcDeadlines.disabled).build()`\n- Python equivalent: `SparkConnectClient(url, rpc_deadlines\u003dRpcDeadlines(analyze_plan\u003d7200.0))` or `rpc_deadlines\u003dRpcDeadlines.disabled()`\n\n### How was this patch tested?\n\nAdded new tests to verify this feature in `SparkConnectClientSuite`, `SparkConnectClientRetriesSuite`, `test_client.py`, `test_client_retries.py`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes. Co-authored with Claude Code (Anthropic).\n\nCloses #55402 from pranavdev022/SPARK-56538-per-rpc-deadlines.\n\nAuthored-by: pranavdev022 \u003cpranavdev022@gmail.com\u003e\nSigned-off-by: Herman van Hövell \u003cherman@databricks.com\u003e\n"
    },
    {
      "commit": "098057d963495e91e5093371ee8a3543f709df55",
      "tree": "22d07af065f677a09c388b68e147dd4be4e914e1",
      "parents": [
        "ac4457e4e5c54efacdcd3463f9284536a19f7427"
      ],
      "author": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Tue Jun 09 15:25:59 2026 +0200"
      },
      "committer": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Tue Jun 09 15:25:59 2026 +0200"
      },
      "message": "[SPARK-57212][SQL] Track preparation and AQE rule timing in `QueryPlanningTracker`\n\n### What changes were proposed in this pull request?\n\nPhysical preparation rules and AQE physical rules run through plain `foldLeft` loops that bypass `RuleExecutor.execute`, so their per-rule timing is never recorded by `QueryPlanningTracker` and therefore not reported by `topRulesByTime` or `RuleExecutor.dumpTimeSpent()`. This PR records that timing by routing those rules through `RuleExecutor`, the same machinery the analyzer and optimizer already use to track theirs.\n\n- Add `QueryExecution.PhysicalRuleExecutor`, a small `RuleExecutor[SparkPlan]` that runs a given rule list as a single `FixedPoint(1)` batch, and route `QueryExecution.prepareForExecution` and `AdaptiveSparkPlanExec.applyPhysicalRules` through it. `RuleExecutor.execute` already times each rule, calls `recordRuleInvocation` on the `QueryPlanningTracker` in scope, and runs `PlanChangeLogger`. `FixedPoint(1)` runs each rule exactly once and skips the idempotence re-check, since preparation rules are not necessarily idempotent.\n- Supply the tracker through `QueryPlanningTracker.withTracker` (a `ThreadLocal`): `QueryExecution` wraps the `PLANNING` phase, and each `AdaptiveSparkPlanExec` wraps `initialPlan` and the body of `withFinalPlanUpdate` with its own tracker. `reOptimize` calls `optimizer.execute`, which reads the same `ThreadLocal`.\n- Give each `AdaptiveSparkPlanExec` -- the main query plan and the plan built for every scalar / IN / DPP subquery -- its own `QueryPlanningTracker`, registered in `AdaptiveExecutionContext.planningTrackers`. Sub-AQEs plan on separate threads (`SubqueryExec` / `BroadcastExchangeExec` pools), but since each one writes only its own tracker, the per-rule recording path needs no synchronization. The main (non-subquery) node merges all of these trackers into the query\u0027s `QueryPlanningTracker` (`context.qe.tracker`) in `finalPlanUpdate`, on the driver thread once the final plan is ready and all subqueries have completed, so the shared tracker is only ever written from a single thread. `QueryPlanningTracker.merge` is added for this (its `RuleSummary` fields are additive).\n\nWith this, `QueryPlanningTracker.rules` and `topRulesByTime` cover physical preparation rules (`EnsureRequirements`, `CollapseCodegenStages`, `ReuseExchangeAndSubquery`, ...) and AQE rules of both the main query and its subqueries (`AdjustShuffleExchangePosition`, `ValidateSparkPlan`, `OptimizeSkewedJoin`, the AQE re-optimizer rules, ...), alongside the analyzer and optimizer rules they already track. These rules also appear in `RuleExecutor.dumpTimeSpent()` / `QueryExecutionMetering`, consistent with every other rule.\n\n`AdaptiveSparkPlanExec.optimizeQueryStage`\u0027s per-stage `queryStageOptimizerRules` (`CoalesceShufflePartitions`, `OptimizeShuffleWithLocalRead`, ...) are out of scope here: that loop has its own `AQEShuffleReadRule` validate/rollback handling and does not fit a plain `RuleExecutor`.\n\n### Why are the changes needed?\n\nA long-running preparation or AQE rule -- for example `EnsureRequirements` over a key-grouped join with many partitions, or AQE rules applied per stage or per subquery -- shows up only inside the `planning` phase total, never per rule, which makes it hard to diagnose where planning time goes. `QueryExecution.normalize` already records per-rule timing through `QueryPlanningTracker`; this brings the same per-rule visibility to physical preparation and AQE. Routing through `RuleExecutor` keeps a single source of truth for rule timing, and giving each plan its own tracker lets subqueries be covered without sharing a mutable `java.util.HashMap` across the threads that plan them.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo query-result or public-API change -- `tracker.rules`, `tracker.topRulesByTime(...)`, and `RuleExecutor.dumpTimeSpent()` simply gain entries for physical preparation and AQE rules.\n\nOne side effect is worth calling out, as it goes slightly beyond observability: `RuleExecutor.execute` runs plan-change validation, which the old `foldLeft` loops did not.\nWith `spark.sql.lightweightPlanChangeValidation` enabled by default, a lightweight validation now runs after each effective preparation/AQE rule -- a small, always-on cost on the planning path, and a new potential `PLAN_VALIDATION_FAILED_RULE_IN_BATCH` site should a rule ever produce an invalid plan. Preparation and AQE rules are expected to produce valid plans, so this is arguably a latent correctness improvement -- but it is more than pure observability.\n\n### How was this patch tested?\n\nNew tests in `QueryPlanningTrackerEndToEndSuite`:\n- `SPARK-57212: Track preparation rules` -- a non-AQE query asserts `EnsureRequirements` and `CollapseCodegenStages` appear in `tracker.rules`.\n- `SPARK-57212: Track AQE-internal preparation rules` -- a shuffle query asserts the AQE-only rules `AdjustShuffleExchangePosition` and `ValidateSparkPlan` appear.\n- `SPARK-57212: Track sub-query AQE rules` -- a query whose only shuffle is inside a scalar subquery asserts that `DynamicJoinSelection` (an AQE re-optimizer rule the shuffle-free main plan never runs) appears, confirming the subquery plan\u0027s tracker is merged into the\nquery tracker.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8\n\nCloses #56275 from peter-toth/SPARK-57212-track-preparation-rules.\n\nAuthored-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\nSigned-off-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\n"
    },
    {
      "commit": "ac4457e4e5c54efacdcd3463f9284536a19f7427",
      "tree": "dd395b6fa061f30c2e453a3efcae96ba2fb87f42",
      "parents": [
        "bf794730ee4b576f47e18d7f1ffa9589be41fb3a"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Tue Jun 09 11:29:39 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Tue Jun 09 11:29:39 2026 +0200"
      },
      "message": "[SPARK-57259][SQL][TEST] Add nanosecond timestamp types to DataTypeTestUtils type sets\n\n### What changes were proposed in this pull request?\n\nAdd `TimestampLTZNanosType` and `TimestampNTZNanosType` (at min and max precision) to the shared type sets in `DataTypeTestUtils` (`ordered`, `atomicTypes`, and the derived `propertyCheckSupported` and `atomicArrayTypes`), mirroring the existing `timeTypes` pattern. Also refresh a stale comment in `OrderingSuite` now that the generic `atomicTypes` loop covers the nanosecond timestamp types.\n\n### Why are the changes needed?\n\n`TimestampLTZNanosType` and `TimestampNTZNanosType` extend `DatetimeType` -\u003e `AtomicType` but were absent from the `DataTypeTestUtils` type sets. Adding them triggers broader generic test coverage across `OrderingSuite`, `PredicateSuite`, `ConditionalExpressionSuite`, `ArithmeticExpressionSuite` (LEAST/GREATEST), `SortSuite`, `RandomDataGeneratorSuite`, and the Cast suites, exposing any gaps in the nanosecond timestamp type infrastructure.\n\n**This PR depends on SPARK-57317 (#56371)**, which fixes `Literal.create` for external nanosecond timestamp values; without it the new coverage in `PredicateSuite` (\"IN with different types\") fails. CI here is expected to be red until #56371 is merged and this branch is rebased.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Test-only change.\n\n### How was this patch tested?\n\nRan the affected suites with this change on top of SPARK-57317: `OrderingSuite`, `RandomDataGeneratorSuite`, `PredicateSuite`, `ConditionalExpressionSuite`, `ArithmeticExpressionSuite`, `CastSuiteBase`, `CastWithAnsiOnSuite`, `CastWithAnsiOffSuite`, `SortSuite`, `UnsafeRowSuite`. All passed.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56372 from MaxGekk/spark-57259-test-sets.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "bf794730ee4b576f47e18d7f1ffa9589be41fb3a",
      "tree": "7b80509cf5db58c4424b4eebcf8647ecfb4cf613",
      "parents": [
        "391d65a0e8f9f4db28fb08748096de52d0568500"
      ],
      "author": {
        "name": "Jiwon Park",
        "email": "jpark92@outlook.kr",
        "time": "Tue Jun 09 17:24:49 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Tue Jun 09 17:24:49 2026 +0800"
      },
      "message": "[SPARK-57274][CONNECT] Support fetch/type accessors and getMoreResults for SparkConnectStatement\n\n### What changes were proposed in this pull request?\n\nImplement the `SparkConnectStatement` accessors that previously threw `SQLFeatureNotSupportedException` (`setFetchSize`/`getFetchSize`,`setFetchDirection`/`getFetchDirection`, `getResultSetType`, `setQueryTimeout`/`getQueryTimeout`), fix `getMoreResults` (which threw, breaking JDBC result-drain loops), and implement the `createStatement` type/concurrency overloads on `SparkConnectConnection`. Follow-up to SPARK-54108 / SPARK-54014.\n\nFetch size / query timeout are stored as hints (Spark Connect is forward-only and server-paginated); `createStatement` accepts forward-only / scroll-insensitive and rejects updatable / scroll-sensitive, mirroring the Spark Thrift Server\u0027s Hive JDBC policy. See SPARK-57274 for details.\n\n### Why are the changes needed?\n\nJDBC client tools (e.g. DataGrip) call these methods around every query, so the throws abort the query path. `getMoreResults` throwing in particular makes the standard drain loop `while (getMoreResults() || getUpdateCount() !\u003d -1)` error out or spin forever; DataGrip hangs on a result-less command such as `USE \u003cdb\u003e`.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, in the unreleased master only: methods that previously threw `SQLFeatureNotSupportedException` are now implemented. No existing behavior is removed.\n\n### How was this patch tested?\n\nNew cases in `SparkConnectStatementSuite`: accessor defaults/validation, drain-loop termination, and the typed `createStatement` overloads.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56341 from j1wonpark/SPARK-57274.\n\nAuthored-by: Jiwon Park \u003cjpark92@outlook.kr\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "391d65a0e8f9f4db28fb08748096de52d0568500",
      "tree": "220d479843a068d283142609e42969e1bc3ae34e",
      "parents": [
        "13ea0f51af365c147caf8cea628584f86685110e"
      ],
      "author": {
        "name": "Boyang Jerry Peng",
        "email": "jerry.peng@databricks.com",
        "time": "Tue Jun 09 00:14:51 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Tue Jun 09 00:14:51 2026 -0700"
      },
      "message": "[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide\n\n### What changes were proposed in this pull request?\n\n  This PR adds a new documentation page for **Real-time Mode** in Structured Streaming, introduced in Spark 4.1.0 (SPARK-53736): `docs/streaming/real-time-mode.md`. The page covers:\n\n  - **How Real-time Mode works**: long-running tasks (one per input partition) that process records continuously, in contrast to the per-micro-batch task scheduling of the default engine.\n  - **Batch duration is a checkpoint interval, not a latency target**.\n  - A **comparison** with the other execution modes.\n  - **Enabling Real-time Mode**: the `Trigger.RealTime(...)` API (Scala/Java) and the `realTime` trigger keyword argument (Python), plus the requirements to start (update output mode,\n  checkpoint location, minimum batch duration).\n  - **Supported queries** (stateless only), **fault tolerance** (exactly-once processing semantics; sinks such as Kafka provide at-least-once delivery), **configuration**, **examples**\n  (Python/Scala/Java), and **caveats**.\n\n  It also registers the new page in the Structured Streaming left navigation (`docs/_data/menu-streaming.yaml`).\n\n Real-time Mode (stateless) was added in Spark 4.1.0 but has no user-facing documentation in the Structured Streaming programming guide. This PR adds that page. See SPARK-57234.\n\n  ### Does this PR introduce _any_ user-facing change?\n\n  No. This is a documentation-only change.\n\n  ### How was this patch tested?\n\n  Documentation-only change. The new page was validated for structure (front matter, code tabs, Liquid `{% highlight %}` tags), internal links and in-page anchors, navigation-menu anchors,\n  and ASCII-only content. All trigger API signatures, configuration keys and defaults, the supported-operator and sink lists, and error-class references were cross-checked against the Spark\n  4.1.0 source (`Trigger.java`, `Triggers.scala`, `RealTimeModeAllowlist.scala`, `SQLConf.scala`, `KafkaMicroBatchStream.scala`, and `error-conditions.json`). Reviewers can verify rendering\n  locally with `SKIP_API\u003d1 bundle exec jekyll build` from the `docs/` directory.\n\n  ### Was this patch authored or co-authored using generative AI tooling?\n\n co-authored with Claude Code (Opus 4.8).\n\nCloses #56314 from jerrypeng/real-time-mode-docs.\n\nAuthored-by: Boyang Jerry Peng \u003cjerry.peng@databricks.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "13ea0f51af365c147caf8cea628584f86685110e",
      "tree": "c2e5aa447aa50743910da7b0294222929645d3dd",
      "parents": [
        "3e7cae796325fc6b251022f6b695dc15100aa6f1"
      ],
      "author": {
        "name": "Boyang Jerry Peng",
        "email": "jerry.peng@databricks.com",
        "time": "Tue Jun 09 00:11:12 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Tue Jun 09 00:11:12 2026 -0700"
      },
      "message": "[SPARK-57281][SQL][SS] Remove @Experimental annotation from Real-time mode\n\n### What changes were proposed in this pull request?\n\nRemove Experimental since the RTM APIs as they are no longer experimental and to make them consistent with other triggers supported in structured streaming.\n\n### Why are the changes needed?\n\nRTM APIs are no longer experimental.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nn/a\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nCo-authored with Claude\n\nCloses #56346 from jerrypeng/rtm-remove-experimental.\n\nAuthored-by: Boyang Jerry Peng \u003cjerry.peng@databricks.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "3e7cae796325fc6b251022f6b695dc15100aa6f1",
      "tree": "88e741b439177d011d6e7bdd2d235f7fcd791979",
      "parents": [
        "b67073f19ecd5a89193f96926cd832bb86a50382"
      ],
      "author": {
        "name": "Rishav Sinha",
        "email": "sinharishav31@gmail.com",
        "time": "Mon Jun 08 20:56:40 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 20:56:40 2026 -0700"
      },
      "message": "[SPARK-50520][PYTHON] Respect timeout in df.rdd.countApprox()\n\n### What changes were proposed in this pull request?\nPySpark approximate RDD actions currently call getFinalValue() on the PartialResult returned by Spark approximate job APIs. This introduces blocking behavior and causes APIs such as countApprox(timeout\u003d...), sumApprox(timeout\u003d...), and meanApprox(timeout\u003d...) to wait for full job completion instead of respecting timeout semantics. This PR updates PySpark to use PartialResult.initialValue(), which contains the timeout-aware approximation produced by ApproximateActionListener.awaitResult(). As a result, approximate RDD actions now return the partial result available at the specified timeout while still returning exact results when the computation completes within the timeout.\nAdditionally, regression tests were added and strengthened to validate:\n- timeout-aware approximate behavior for both countApprox and meanApprox using deliberately slow workloads,\n- exact results when computation completes successfully,\n- cleanup of background approximate jobs to avoid interference with subsequent tests.\n\n### Why are the changes needed?\nSpark approximate actions are designed to return partial results after the specified timeout. Scala APIs already expose this behavior through PartialResult, but PySpark currently forces blocking completion by calling getFinalValue(). As a result, PySpark approximate actions ignore timeout semantics and wait for the entire job to finish, which is inconsistent with the intended behavior of Spark\u0027s approximate execution APIs.\n\n### Does this PR introduce _any_ user-facing change?\nYes. PySpark approximate RDD actions (countApprox, sumApprox, and meanApprox) now correctly respect timeout semantics and may return timeout-aware approximate results instead of blocking until full job completion.\n\n### How was this patch tested?\n- Reproduced the issue locally using a slow-running RDD workload.\n- Verified timeout behavior before and after the fix.\n- Verified exact results are still returned when the computation completes within the timeout.\n- Added regression tests in python/pyspark/tests/test_rdd.py.\n- Ran: python/run-tests.py --testnames pyspark.tests.test_rdd\n- Manually validated timeout and completed-result behavior for countApprox, sumApprox, and meanApprox.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #56060 from rishav23/fix-spark-50520-countapprox-timeout.\n\nAuthored-by: Rishav Sinha \u003csinharishav31@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "b67073f19ecd5a89193f96926cd832bb86a50382",
      "tree": "5b7342671b3fdeaa17d53a6845e60edf4795b226",
      "parents": [
        "cc64f0a3d84c9958e7b17fadafcf34d0612294a1"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Mon Jun 08 20:53:44 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 20:53:44 2026 -0700"
      },
      "message": "[SPARK-57253][SQL] Add `jaro_winkler_similarity` built-in function\n\n### What changes were proposed in this pull request?\nThis PR adds a new built-in string function `jaro_winkler_similarity(str1, str2)` that computes the Jaro-Winkler similarit\ny between two strings. The result is a double between 0.0 (no similarity) and 1.0 (identical strings).\n\nThe Jaro-Winkler metric is an extension of the Jaro similarity that gives a bonus for common prefixes (up to 4 characters)\n, making it especially suited for short strings such as names.\n\nChanges:\n* `ExpressionImplUtils.jaroWinklerSimilarity()` — core algorithm implementation\n* `JaroWinkler` expression as `RuntimeReplaceable` + `StaticInvoke`\n* Registration as `jaro_winkler_similarity` in `FunctionRegistry`\n* Scala DataFrame API in `functions.scala`\n* PySpark API in `pyspark.sql.functions` + Spark Connect\n* Unit tests covering basic cases, symmetry, multi-byte strings, and null handling\n\n### Why are the changes needed?\nJaro-Winkler is one of the most commonly used string similarity metrics for record linkage, deduplication, and fuzzy matching. It is available as a built-in function in DuckDB, SQL Server 2025, and various other engines, but Spark users currently need to implement it as a UDF.\n\nA built-in function provides:\n- Codegen support (no UDF overhead in whole-stage codegen pipelines)\n- Consistency with the existing `levenshtein` function\n- Better discoverability for SQL users\n\n### Does this PR introduce _any_ user-facing change?\nYes. A new SQL function `jaro_winkler_similarity` is added.\n\n### How was this patch tested?\nAdded new tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Claude\n\nCloses #56310 from sarutak/jaro-winkler.\n\nAuthored-by: Kousuke Saruta \u003csarutak@apache.org\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "cc64f0a3d84c9958e7b17fadafcf34d0612294a1",
      "tree": "7d0455b5a913e6317c7d817269d3b5c32539ff84",
      "parents": [
        "3744250bc7bae17f29818831e0037035041a90a6"
      ],
      "author": {
        "name": "Anupam Yadav",
        "email": "anupamya@amazon.com",
        "time": "Mon Jun 08 20:51:19 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 20:51:19 2026 -0700"
      },
      "message": "[SPARK-52719][SQL] Support using scalar UDFs in TVF arguments\n\n### What changes were proposed in this pull request?\n\nAdd a guard in `ResolveSQLFunctions` (Analyzer.scala) that allows `SQLFunctionExpression` nodes inside `SQLTableFunction` inputs to pass through for downstream resolution, instead of being eagerly rejected by the catch-all case.\n\n### Why are the changes needed?\n\n`SELECT * FROM tvf(scalar_udf(true))` incorrectly throws `AnalysisException: Using SQL function ... in SQLTableFunction is not supported`. Scalar UDFs return scalar values and should be valid TVF arguments. The catch-all case in `ResolveSQLFunctions` was firing on `SQLTableFunction` before its inputs were resolved, incorrectly treating scalar UDFs as unsupported.\n\nNote: `SQLFunctionExpression` is exclusively for scalar functions -- TVFs use `SQLTableFunction` (a LogicalPlan node). The guard cannot accidentally permit nested TVFs because `SessionCatalog.makeSQLFunctionBuilder` throws `NOT_A_SCALAR_FUNCTION` if a TVF name is used in a scalar position.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes -- scalar SQL UDFs can now be used as arguments to table-valued functions.\n\n### How was this patch tested?\n\nAdded 5 tests in `SQLFunctionSuite`:\n1. **Happy path**: `tvf(scalar_udf(1))` returns correct result\n2. **Negative test**: TVF-in-TVF argument is still rejected with `NOT_A_SCALAR_FUNCTION`\n3. **Nested scalar UDFs**: `tvf(add_one(add_one(1)))` resolves recursively\n4. **Mixed args**: `multi_tvf(scalar_udf(3), 10)` with UDF + literal\n5. **NULL propagation**: `tvf(null_udf(1))` correctly returns null\n\nWithout the fix, test 1 throws `AnalysisException`. With the fix, all 5 pass.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes.\n\nCloses #55950 from yadavay-amzn/fix/SPARK-52719-udf-in-tvf.\n\nAuthored-by: Anupam Yadav \u003canupamya@amazon.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "3744250bc7bae17f29818831e0037035041a90a6",
      "tree": "6916df79456e85bdd900abf119806254fa3d128c",
      "parents": [
        "ac0c117cf280e1a44e8fdc0c00a91740bb9a6abf"
      ],
      "author": {
        "name": "Anupam Yadav",
        "email": "anupamya@amazon.com",
        "time": "Mon Jun 08 19:58:42 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 19:58:42 2026 -0700"
      },
      "message": "[SPARK-54876][SQL] Fix splitSemiColon dropping statement ending with block comment\n\n### What changes were proposed in this pull request?\n\nReplace the flag-based `splitSemiColon` implementation with a structural SQL-aware scanner (per cloud-fan\u0027s review). The scanner uses three `consumeXxx` helpers (string/line-comment/block-comment) that make the interior of strings and comments opaque to the outer loop.\n\n`SparkSQLCLIDriver.splitSemiColon` now delegates to `StringUtils.splitSemiColonWithIndex` (single implementation, no duplication).\n\n### Why are the changes needed?\n\nThe original bug: `splitSemiColon` drops the last SQL statement when it ends with a block comment. The incremental fix approach (adding ad-hoc string surgery) was fragile -- each patch introduced the next edge case. The structural scanner fixes all known cases by construction:\n- Trailing block comment after last statement\n- Nested block comments (`/* outer /* inner */ */`)\n- Line comments preceding block comments\n- Semicolons inside string literals and backtick-quoted identifiers\n\n### Behavior change note\n\n`SparkSQLCLIDriver.splitSemiColon` now treats backtick-quoted identifiers (`` ` ``) as string delimiters, matching ANSI identifier quoting and the existing `StringUtils.splitSemiColonWithIndex` behavior. Semicolons inside backtick-quoted identifiers no longer split.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes -- SQL statements ending with block comments are no longer silently dropped. Backtick-quoted identifiers containing semicolons are now handled correctly.\n\n### How was this patch tested?\n\n- All existing `StringUtilsSuite` tests pass (12 tests)\n- All existing `CliSuite` SPARK-37906 tests pass\n- New tests for: trailing block comment, nested block comments, line comment + block comment, backtick-quoted identifiers\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes.\n\nCloses #55466 from yadavay-amzn/fix/SPARK-54876-split-semicolon.\n\nAuthored-by: Anupam Yadav \u003canupamya@amazon.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "ac0c117cf280e1a44e8fdc0c00a91740bb9a6abf",
      "tree": "33e24b6eb4a90e0fd028b5958a7194d007e1858f",
      "parents": [
        "0993d4345969dfe16b334598dc80a452e4a270f7"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Tue Jun 09 09:40:15 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Tue Jun 09 09:40:15 2026 +0800"
      },
      "message": "[SPARK-57330][INFRA] Switch shared CI compile artifacts to zstd compression\n\n### What changes were proposed in this pull request?\n\nCompress the shared compile artifacts that CI passes between jobs with `zstd` instead of `gzip`, across the three reusable workflows that produce/consume them:\n\n- `build_and_test.yml` - `compile-artifact` (produced by `precompile`, consumed by the `build`, `pyspark`, `sparkr`, `tpcds-1g`, `docker-integration-tests` and `k8s-integration-tests` jobs)\n- `python_hosted_runner_test.yml` - `compile-artifact` (macOS / ARM python matrix)\n- `maven_test.yml` - `compile-target` and `compile-m2-spark`\n\nEach producer pipes `tar` through the `zstd` binary, and each consumer decompresses with `zstd -dc`:\n\n```bash\n# create\n... | tar --null -cf - -T - | zstd -c -T0 \u003e compile-artifact.tar.zst\n# extract\nzstd -dc compile-artifact.tar.zst | tar -xf -\n```\n\n`zstd` is driven through the standalone binary rather than `tar --zstd` on purpose: GitHub\u0027s macOS runners ship bsdtar, whose `--zstd` hangs, so letting `tar` do plain (un)archiving and piping through the `zstd` binary keeps one portable idiom across Linux hosts, the container images, `ubuntu-24.04-arm` and macOS.\n\n### Why are the changes needed?\n\n`zstd` compresses and (especially) decompresses much faster than `gzip` at a comparable or better ratio, and `-T0` parallelizes compression across all cores. These artifacts are produced once and downloaded by up to 8 downstream jobs per run, so faster (de)compression shortens the critical path of every matrix entry. The `zstd` binary is already present on all relevant runners and container images (the latter since SPARK-57278), so no new dependency is introduced.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. CI-only.\n\n### How was this patch tested?\n\n- Local validation: all three workflows parse as YAML, and no `*.tar.gz` reference to these artifacts remains.\n- The artifacts are ephemeral per run (`retention-days: 1`, names keyed by `run_id`), so producer and consumer always run the same code; there is no cross-version compatibility concern.\n- The `build_and_test.yml` path is exercised by this PR\u0027s CI. The `maven_test.yml` and `python_hosted_runner_test.yml` paths run via their scheduled / `workflow_dispatch` callers and can be validated by dispatching those workflows on the fork.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8\n\nCloses #56369 from zhengruifeng/shared-artifact-zstd-dev6.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "0993d4345969dfe16b334598dc80a452e4a270f7",
      "tree": "9ddd06eb03b9143e8f8e12148eed244e23f296e1",
      "parents": [
        "3c31d68fc6aa1353f3f3e998246499e21e8092be"
      ],
      "author": {
        "name": "tonghuaroot (童话)",
        "email": "tonghuaroot@gmail.com",
        "time": "Tue Jun 09 08:54:05 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Tue Jun 09 08:54:05 2026 +0800"
      },
      "message": "[SPARK-57314][PS][TEST] Add tests for Index.equals in pandas-on-Spark\n\n### What changes were proposed in this pull request?\n\nThis PR adds positive parity tests for `Index.equals` in pandas-on-Spark\n(`IndexBasicMixin` in `python/pyspark/pandas/tests/indexes/test_basic.py`).\nThe tests compare the `True`/`False` result of `Index.equals` against pandas\nfor single `Index` and `MultiIndex`: equal and unequal element sets,\nreordered elements, and the fact that `equals` ignores the index name.\n\n### Why are the changes needed?\n\n`Index.equals` currently has no positive test coverage. The only existing\ntest exercises the error path when `compute.ops_on_diff_frames` is disabled\nand never checks the actual `True`/`False` result, so the common behavior is\nuntested. These tests close that gap.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This PR only adds tests.\n\n### How was this patch tested?\n\nRan the new test against a local SparkSession\n(`IndexBasicTests.test_equals`); it passes.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes, this patch was co-authored with generative AI tooling (Claude,\nAnthropic Opus 4.8). The contributor selected the under-tested method,\nrequired that the asserted cases first be verified to match pandas on a\nreal SparkSession (excluding inputs where pandas-on-Spark intentionally\ndiffers, e.g. NaN comparison under Spark equality), and reviewed the result.\n\nCloses #56362 from tonghuaroot/ps-tests-identical-equals.\n\nAuthored-by: tonghuaroot (童话) \u003ctonghuaroot@gmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "3c31d68fc6aa1353f3f3e998246499e21e8092be",
      "tree": "a726f19be55c355fc79f46f8c5ac4344e963a8df",
      "parents": [
        "49908a2cf92870bb15381b2cc80da51b51b491ed"
      ],
      "author": {
        "name": "Shrirang Mhalgi",
        "email": "shrirangmhalgi@gmail.com",
        "time": "Mon Jun 08 17:14:45 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 17:14:45 2026 -0700"
      },
      "message": "[SPARK-57287][SQL] Escape backslash in LIKE pattern for STARTS_WITH/ENDS_WITH/CONTAINS pushdown\n\n### What changes were proposed in this pull request?\nEscape the backslash character in `V2ExpressionSQLBuilder.escapeSpecialCharsForLikePattern` so that `STARTS_WITH`, `ENDS_WITH`, and `CONTAINS` predicates containing backslashes produce correct `LIKE` patterns when pushed down to V2 data sources.\n\n### Why are the changes needed?\nThe LIKE patterns generated for predicate pushdown declare `ESCAPE \u0027\\\u0027`, making backslash the escape character. However, `escapeSpecialCharsForLikePattern` only escapes `_` and `%` - it does not escape the backslash itself. When a filter value contains a backslash (e.g., `startsWith(\"abc\\\")`), the pushed SQL becomes LIKE `\u0027abc\\%\u0027` where `\\%` is interpreted as \"literal %\" rather than \"literal backslash followed by wildcard %\". This produces silently wrong results on the remote database.\n\n### Does this PR introduce _any_ user-facing change?\nYes. V2 data source queries with `startsWith`, `endsWith`, or `contains` predicates on values containing backslashes now return correct results instead of silently wrong results.\n\n### How was this patch tested?\nAdded regression tests in `JDBCV2Suite` covering `startsWith`, `endsWith`, and `contains` with backslash values against the H2 test database. Tests fail without the fix and pass with it.\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes. Authored using Claude Opus 4.6\n\nCloses #56350 from shrirangmhalgi/backslash-issue-fix.\n\nAuthored-by: Shrirang Mhalgi \u003cshrirangmhalgi@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "49908a2cf92870bb15381b2cc80da51b51b491ed",
      "tree": "8085b6542f089886e1e428bad07c57c3d7747c0d",
      "parents": [
        "0849776499c9d223ee6fa5a2dd53b999f6f39818"
      ],
      "author": {
        "name": "Eric Yang",
        "email": "jiwen624@gmail.com",
        "time": "Mon Jun 08 17:13:33 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 17:13:33 2026 -0700"
      },
      "message": "[SPARK-57298][SQL] collect_set fails to dedupe float/double NaN/-0.0 by their semantics\n\n### What changes were proposed in this pull request?\n\n`CollectSet` now normalizes special floating-point values (NaN and `-0.0`) before inserting them into its deduplication buffer, so it follows Spark\u0027s float/double equality semantics (all NaNs are equal; `-0.0` equals `0.0`). This covers both scalar and nested (struct/array) `FLOAT`/`DOUBLE` columns:\n\n- **Top-level `DOUBLE`/`FLOAT`:** the buffer element type becomes `LONG`/`INT` and values are stored as their normalized bit pattern. `convertToBufferElement` reuses `NormalizeFloatingNumbers.DOUBLE_NORMALIZER`/`FLOAT_NORMALIZER` (NaN -\u003e canonical, `-0.0` -\u003e `0.0`) and then `doubleToLongBits`/`floatToIntBits`; `eval` converts the bits back. Keying on bits is required because the `HashSet` buffer compares boxed numbers with primitive equality, where `NaN !\u003d NaN`. Normalizing `-0.0` is necessary here so that `-0.0`/`0.0`, which deduplicate today, keep deduplicating once the buffer keys on the bit pattern (otherwise `doubleToLongBits(-0.0) !\u003d doubleToLongBits(0.0)` would regress it).\n- **Complex types recursively containing `FLOAT`/`DOUBLE` (struct/array):** values are normalized with `NormalizeFloatingNumbers.normalize` and materialized as `UnsafeRow`/`UnsafeArrayData`, so the buffer deduplicates on the canonical binary representation. `MapType` is unaffected (already rejected by `checkInputDataTypes`).\n\nThis reuses the normalization logic Spark already applies to other hash-based array set operation (`NormalizeFloatingNumbers`, case 5).\n\n### Why are the changes needed?\n\n`collect_set` over `FLOAT`/`DOUBLE` did not follow Spark\u0027s floating-point equality semantics and returned elements that should be considered equal:\n\n```sql\n-- Top-level NaN:\nSELECT collect_set(v) FROM VALUES (double(\u0027NaN\u0027)), (double(\u0027NaN\u0027)) AS t(v);\n-- before: [NaN, NaN]   after: [NaN]\n\n-- Nested -0.0 / 0.0:\nSELECT collect_set(a) FROM VALUES (array(-0.0D)), (array(0.0D)) AS t(a);\n-- before: [[-0.0], [0.0]]   after: [[0.0]]\nSELECT collect_set(named_struct(\u0027a\u0027, v)) FROM VALUES (-0.0D), (0.0D) AS t(v);\n-- before: [{a:-0.0}, {a:0.0}]   after: [{a:0.0}]\n```\n\n(Top-level `-0.0`/`0.0` already deduplicate today; this PR preserves that while fixing the cases above.)\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. For `collect_set` over `FLOAT`/`DOUBLE` columns, including struct/array columns that contain them:\n- duplicate `NaN` values are no longer returned, and\n- `-0.0` and `0.0` are deduplicated even when nested in a struct/array.\n\nIn addition, a `-0.0` element is now always returned as `0.0`. `collect_set` is already documented as non-deterministic, and `-0.0`/`0.0` were already collapsed at the top level, so this only affects the returned representation.\n\n### How was this patch tested?\n\nNew test cases.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes. Claude Code\n\nCloses #56360 from jiwen624/collect-set-nan-dedup.\n\nAuthored-by: Eric Yang \u003cjiwen624@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "0849776499c9d223ee6fa5a2dd53b999f6f39818",
      "tree": "1cc391993a9877a0050b1ca9778b4bee24b614d6",
      "parents": [
        "bee16dc7143541ba52d16e31cfd1e5913c38f96c"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 16:30:11 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 16:30:11 2026 -0700"
      },
      "message": "[SPARK-57326][SQL][TEST] Honor DEFAULT_ARTIFACT_REPOSITORY in IsolatedClientLoaderIvySettingsSuite\n\n### What changes were proposed in this pull request?\n\n`IsolatedClientLoaderIvySettingsSuite` writes an `ivysettings.xml` whose resolver root is hardcoded to `https://repo1.maven.org/maven2/`. This PR reads the root from the `DEFAULT_ARTIFACT_REPOSITORY` environment variable instead, falling back to `https://repo1.maven.org/maven2/` when it is unset — the same way `MavenUtils.createRepoResolvers` already picks the default `central` resolver root.\n\n### Why are the changes needed?\n\nThe test downloads the full Hive 2.3 dependency closure over the network and it exercises `MavenUtils`, which already honors `DEFAULT_ARTIFACT_REPOSITORY` to redirect Ivy resolution. By hardcoding Maven Central, the test diverges from the very code path it covers: in an environment that sets `DEFAULT_ARTIFACT_REPOSITORY`, real Hive-jar downloads go through the configured repository, but this test still forces direct Maven Central access.\n\nThis is a problem in CI that proxies or mirrors Maven Central and may block or rate-limit direct access to `repo1.maven.org`, where resolution then stalls or times out. Spark already accounts for Maven Central being flaky in CI; see the `gcs-maven-central-mirror` resolver in `project/SparkBuild.scala`, placed first \"so that it\u0027s used instead of flaky Maven Central.\" Letting the test honor `DEFAULT_ARTIFACT_REPOSITORY` makes it consistent with that code and lets it run reliably wherever a mirror is configured.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Test-only change. Default behavior (variable unset) is unchanged.\n\n### How was this patch tested?\n\nExisting test (`IsolatedClientLoaderIvySettingsSuite`). Verified that resolution uses the configured repository when `DEFAULT_ARTIFACT_REPOSITORY` is set, and falls back to Maven Central when it is unset.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56376 from cloud-fan/wenchen/minor-ivysettings-default-artifact-repo.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "bee16dc7143541ba52d16e31cfd1e5913c38f96c",
      "tree": "f1e49f802d54b39a3e10e5ef34193946294b6ae8",
      "parents": [
        "3ebf8d6b519767fbd55227080cffb23e14810e3d"
      ],
      "author": {
        "name": "akshatshenoi-db",
        "email": "akshat.shenoi@databricks.com",
        "time": "Mon Jun 08 16:15:15 2026 -0700"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Mon Jun 08 16:15:15 2026 -0700"
      },
      "message": "[SPARK-57135][SQL] Support reading CSV files inside tar archives\n\n### What changes were proposed in this pull request?\n\nAdds support for reading CSV files packaged in tar archives (`.tar`, `.tar.gz`, `.tgz`) directly through the CSV data source, by **streaming** each archive entry through the CSV parser without unpacking it to disk. Gated behind a new config `spark.sql.files.archive.reader.enabled` (default `false`).\n\n- **`ArchiveReader`** (new): a small streaming core. `ArchiveReader(path)` selects an implementation by file extension, and `readEntries(conf)(parseEntry)` opens the archive once, hands each non-skipped entry to `parseEntry` as a bounded, non-closing `InputStream`, and concatenates the per-entry results into a single iterator. It advances to the next entry only after the current one is fully consumed, so at most one entry is in flight and memory stays bounded regardless of archive size. Directories and entries that Spark\u0027s own file listing filters out -- dot- and underscore-prefixed names such as `._*`, `.DS_Store`, `_SUCCESS`, `_committed_*` (via `HadoopFSUtils.shouldFilterOutPathName`) -- are skipped, so an archive parses like a directory of the same files; the stream is closed on exhaustion, on `close()`, and (defensively) on task completion. `ArchiveReader` is an abstract base; `TarArchiveReader` is the only implementation today. `.tar.gz` is auto-decompressed by Hadoop\u0027s codec factory; `.tgz` (not a registered codec extension) is unwrapped with `GZIPInputStream`.\n- **`CSVFileFormat`**: archives are non-splittable (`isSplitable` returns `false`), so each archive is read as a single split; `buildReader` streams every entry through `UnivocityParser` (`parseStream` for `multiLine`, otherwise `parseIterator` over a `LineReader`-backed line iterator). Each entry is treated as the start of its own file, so headers are validated and dropped per entry, exactly as for standalone CSV files.\n- **`CSVDataSource`**: a `readArchive` path streams entries through the same per-entry parser / header-checker construction used for a standalone CSV read. It lives on the V1 `CSVFileFormat` read path only; the V2 file data source calls `readFile` directly and is intentionally left untouched.\n\nThe streaming approach avoids local disk entirely; the trade-off is that it only supports formats parseable from a sequential stream, so this PR scopes the feature to CSV over tar. Formats that need random access within a file (Parquet/ORC footers) cannot stream from a tar and are out of scope.\n\nTwo scope notes:\n- **Schema inference is not yet supported for archives** -- an explicit schema is required. Reading an archive without one raises Spark\u0027s standard `UNABLE_TO_INFER_SCHEMA` error (\"Unable to infer schema for CSV. It must be specified manually.\"). Inferring a schema by streaming archive entries is a planned follow-up.\n- `ignoreCorruptFiles`/`ignoreMissingFiles` are **archive-granular**: because a tar is opened as a single non-splittable stream, a corrupt or missing archive is skipped whole -- unlike a directory of loose files, where only the bad file is skipped and the rest are kept.\n\nThe `ArchiveReader` abstraction -- extension-dispatched `apply`, one subclass per archive format, and a format-agnostic `lineIterator` -- is a deliberate seam: other file formats (e.g. JSON, text, XML) and other archive formats are intended to be added later as additive subclasses/bindings, without reworking this core.\n\nThis change was reviewed by Alden Lau on the ingestion core team.\n\n### Why are the changes needed?\n\nA common ingestion pattern packs many small CSV files into tar archives to reduce file/namespace pressure on object stores and HDFS. Today these cannot be read without unpacking them externally first. This lets users point the CSV reader directly at a tar archive. Streaming (rather than materializing entries to local disk) keeps the read bounded in memory and adds no local-disk requirement.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. A new config `spark.sql.files.archive.reader.enabled` (default `false`) is added. When enabled, the CSV data source reads `.tar`/`.tar.gz`/`.tgz` paths by streaming their entries during a scan; an explicit schema is required, since schema inference for archives is not yet supported. With the default `false`, behavior is unchanged.\n\n### How was this patch tested?\n\nNew tests:\n- `ArchiveReaderSuite` (unit): `isArchivePath` dispatch and `readEntries` -- entry ordering, gzip handling (`.tar.gz` and `.tgz`), directory/dotfile skipping, lazy one-entry-at-a-time advance, the non-closing entry stream, idempotent `close()`, and `TaskContext` cleanup.\n- End-to-end CSV reads of `.tar`/`.tar.gz`/`.tgz` through the data source, asserting parity with reading the same entries as loose files in a directory. The format- and archive-agnostic harness (`ArchiveReadSuiteBase` + `TarArchiveReadBase`) is bound to CSV by `CSVArchiveReadBase`, split into header (`CSVHeaderTarArchiveReadSuite`) and headerless (`CSVHeaderlessTarArchiveReadSuite`) suites so the shared tests run in both modes. Coverage includes multi-entry reads, column pruning, a mixed archive/loose partitioned layout, empty archives, single-partition splittability, `ignoreCorruptFiles`, mismatched headers, custom delimiter, and multiline quoted fields. Also: reading an archive without a schema raises `UNABLE_TO_INFER_SCHEMA`, and a corrupt archive among good ones is skipped whole under `ignoreCorruptFiles` (verifying the archive-granular behavior).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56193 from akshatshenoi-db/archive-format.\n\nAuthored-by: akshatshenoi-db \u003cakshat.shenoi@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "3ebf8d6b519767fbd55227080cffb23e14810e3d",
      "tree": "ca54e256e64991a271979a835ecda6efc44d3714",
      "parents": [
        "d9c50b2bfb3c7a99753f2d18c835e919e39f86cd"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Mon Jun 08 16:09:07 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Mon Jun 08 16:09:07 2026 -0700"
      },
      "message": "[SPARK-57320][BUILD] Upgrade Netty to 4.2.15.Final\n\n### What changes were proposed in this pull request?\n\nThis PR aims to upgrade `Netty` to 4.2.15.Final.\n\n### Why are the changes needed?\n\nTo bring the latest bug fixes:\n\n- https://netty.io/news/2026/06/01/4-2-15-Final.html\n  - [CVE-2026-48059](https://github.com/netty/netty/security/advisories/GHSA-h2qv-fj59-j46j): memory exhaustion in io.netty:netty-codec-haproxy (high).\n  - [CVE-2026-47691](https://github.com/netty/netty/security/advisories/GHSA-5pvg-856g-cp85): DNS cache poisoning in io.netty:netty-resolver-dns (high).\n  - [CVE-2026-50560](https://github.com/netty/netty/security/advisories/GHSA-563q-j3cm-6jxm): DDoS in io.netty:netty-codec-http2.\n  - [CVE-2026-50011](https://github.com/netty/netty/security/advisories/GHSA-5w86-c3rq-vjj7): memory exhaustion in io.netty:netty-codec-redis (high).\n  - [CVE-2026-44250](https://github.com/netty/netty/security/advisories/GHSA-3244-j874-rhc2): memory exhaustion in io.netty:netty-codec-redis (high).\n  - [CVE-2026-44890](https://github.com/netty/netty/security/advisories/GHSA-6ghj-frrj-jjj3): memory exhaustion in io.netty:netty-codec-redis (high).\n  - [CVE-2026-50009](https://github.com/netty/netty/security/advisories/GHSA-cq4q-cv5g-r8q5): information disclosure and denial of service in io.netty:netty-codec-classes-quic.\n  - [CVE-2026-44249](https://github.com/netty/netty/security/advisories/GHSA-3qp7-7mw8-wx86): IPv6 subnet filter bypass in io.netty:netty-handler (high).\n  - [CVE-2026-50020](https://github.com/netty/netty/security/advisories/GHSA-hvcg-qmg6-jm4c): request smuggling in io.netty:netty-codec-http.\n  - [CVE-2026-44892](https://github.com/netty/netty/security/advisories/GHSA-c2rx-5r8w-8xr2): memory exhaustion in io.netty:netty-codec-http3 (high).\n  - [CVE-2026-44893](https://github.com/netty/netty/security/advisories/GHSA-cc37-9q2j-3hfv): memory leak in io.netty:netty-codec-haproxy (high).\n  - [CVE-2026-44894](https://github.com/netty/netty/security/advisories/GHSA-cmm3-54f8-px4j): traffic amplification in io.netty:netty-codec-classes-quic (high).\n  - [CVE-2026-50010](https://github.com/netty/netty/security/advisories/GHSA-c653-97m9-rcg9): TLS hostname verification accidentally disabled in io.netty:netty-handler (high).\n  - [CVE-2026-45673](https://github.com/netty/netty/security/advisories/GHSA-xmv7-r254-6q78): DNS cache poisoning in io.netty:netty-resolver-dns.\n  - [CVE-2026-45416](https://github.com/netty/netty/security/advisories/GHSA-x4gw-5cx5-pgmh): excessive memory usage from SNIHandler in io.netty:netty-handler (high).\n  - [CVE-2026-45536](https://github.com/netty/netty/security/advisories/GHSA-w573-9ffj-6ff9): file descriptor leak in io.netty:netty-transport-native-epoll and io.netty:netty-transport-native-kqueue.\n  - [CVE-2026-45674](https://github.com/netty/netty/security/advisories/GHSA-676x-f7gg-47vc): DNS cache poisoning in io.netty:netty-resolver-dns (high).\n  - [CVE-2026-46340](https://github.com/netty/netty/security/advisories/GHSA-5xrh-qmmq-w6ch): memory exhaustion in io.netty:netty-transport-sctp (high).\n  - [CVE-2026-47244](https://github.com/netty/netty/security/advisories/GHSA-5x3r-wrvg-rp6q): denial of service in io.netty:netty-codec-http2.\n  - [CVE-2026-48006](https://github.com/netty/netty/security/advisories/GHSA-6jv9-x5w9-2ccm): memory exhaustion in io.netty:netty-codec-redis (high).\n  - [CVE-2026-48748](https://github.com/netty/netty/security/advisories/GHSA-4grm-h2qv-h6w6): memory exhaustion in io.netty:netty-codec-http3 (high).\n  - [CVE-2026-48043](https://github.com/netty/netty/security/advisories/GHSA-c2gf-v879-257j): memory exhaustion in io.netty:netty-codec-http2.\n  - https://github.com/netty/netty/pull/16836\n  - https://github.com/netty/netty/pull/16810\n  - https://github.com/netty/netty/pull/16853\n  - https://github.com/netty/netty/pull/16837\n  - https://github.com/netty/netty/pull/16844\n  - https://github.com/netty/netty/pull/16850\n  - https://github.com/netty/netty/pull/16890\n\n- https://netty.io/news/2026/05/20/4-2-14-Final.html\n  - https://github.com/netty/netty/pull/16747\n  - https://github.com/netty/netty/pull/16759\n  - https://github.com/netty/netty/pull/16767\n  - https://github.com/netty/netty/pull/16781\n  - https://github.com/netty/netty/pull/16788\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8\n\nCloses #56373 from dongjoon-hyun/SPARK-57320.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "d9c50b2bfb3c7a99753f2d18c835e919e39f86cd",
      "tree": "c38c1d2b2659895638de7db63df7b67ee00bbbe3",
      "parents": [
        "761afcb676189a4ef58439441da03ba73aed1e21"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Mon Jun 08 16:04:40 2026 -0700"
      },
      "committer": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Mon Jun 08 16:04:40 2026 -0700"
      },
      "message": "[SPARK-57224][INFRA] Add input check for merge script\n\n### What changes were proposed in this pull request?\n\nAdd a utility function `get_input` for merge script to accept user input with filters - it can be either some acceptable options or a regex.\n\n### Why are the changes needed?\n\nIf the user is prompt a `y/N` question, but they typed a branch name `branch-4.2`, it\u0027s equivalent to type `N`, which is not ideal. The input should only take acceptable answers.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nReviewed by Claude. Did some quick manual test.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56285 from gaogaotiantian/merge-script-input.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\n"
    },
    {
      "commit": "761afcb676189a4ef58439441da03ba73aed1e21",
      "tree": "f7caf378acf3a65c371ab54c6509d47dd0993316",
      "parents": [
        "952a283d39a6c90cce20f33812a0b77b5c18c239"
      ],
      "author": {
        "name": "Nikolina Vraneš",
        "email": "nikolina.vranes@databricks.com",
        "time": "Tue Jun 09 06:09:23 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Jun 09 06:09:23 2026 +0800"
      },
      "message": "[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution\n\n### What changes were proposed in this pull request?\n\nThis is the first PR in a planned series implementing the `BIN BY` relation operator (SPARK-57133). It adds the parser, analyzer, and error classes. Physical execution is intentionally stubbed and lands in a follow-up PR.\n\n`BIN BY` is a relation-level operator (same grammar position as `PIVOT` / `UNPIVOT`) that aligns range-typed rows to fixed-width bin boundaries: it splits any row whose `[range_start, range_end)` crosses a boundary and proportionally redistributes selected numeric or day-time-interval values across the resulting sub-ranges. The target use case is telemetry and observability data, where each row carries its own measurement window (OpenTelemetry, Prometheus exports).\n\nSyntax:\n```sql\nSELECT * FROM relation BIN BY (\n  RANGE rangeStartCol TO rangeEndCol\n  BIN WIDTH widthExpr\n  [ALIGN TO originExpr]\n  DISTRIBUTE UNIFORM (distributeCol [, distributeCol ...])\n  [BIN_START AS aliasName] [BIN_END AS aliasName] [BIN_DISTRIBUTE_RATIO AS aliasName]\n) [AS resultAlias];\n```\n\nWhat this PR adds:\n- Grammar (`SqlBaseLexer.g4`, `SqlBaseParser.g4`): the `binByClause` rule and 7 new non-reserved keywords (`BIN`, `WIDTH`, `ALIGN`, `UNIFORM`, `BIN_START`, `BIN_END`, `BIN_DISTRIBUTE_RATIO`), wired into `relationExtension` and the pipe `operatorPipeRightSide`, with an optional trailing table alias.\n- Logical plans (`basicLogicalOperators.scala`): `UnresolvedBinBy` (parser output) and the resolved `BinBy`, plus the `BinByOutputAliases` helper. This follows the two-class `Unpivot` -\u003e `UnpivotTransformer` precedent.\n- AST builder (`AstBuilder.scala`): `withBinBy`, which wraps the node in a `SubqueryAlias` when a trailing alias is present.\n- Analyzer rule (`ResolveBinBy.scala`, wired into `Analyzer.scala`): resolves column references against the child output, validates types and foldability, folds the `BIN WIDTH` and `ALIGN TO` expressions to micros (each guarded so a foldable-but-throwing expression, e.g. an ANSI CAST failure, surfaces as a clean `BIN_BY_*` error rather than `INTERNAL_ERROR`), fills the default origin (session-zone-anchored for `TIMESTAMP`, wall-clock epoch for `TIMESTAMP_NTZ`), captures the session time zone, and builds the output schema. Registered in `RuleIdCollection`; the `BIN_BY` / `UNRESOLVED_BIN_BY` tree patterns are added in `TreePatterns`.\n- Self-join support (`DeduplicateRelations.scala`): `BinBy` is an attribute-producing node, so it is registered in both dedup phases (`renewDuplicatedRelations` and `collectConflictPlans`) to renew the appended attributes\u0027 `ExprId`s for self-joins over a shared `BinBy` subtree, matching the `Generate` / `AttachDistributedSequence` producer pattern.\n- Error classes (`error-conditions.json`, `QueryCompilationErrors.scala`): the 11 `BIN_BY_*` conditions, with analysis-time builders for the 10 raised during resolution, including `BIN_BY_INVALID_ALIGN_TO` for an `ALIGN TO` expression that fails to fold (the runtime `BIN_BY_INVALID_RANGE` is defined here and raised in the execution PR).\n- Execution stub (`SparkStrategies.scala`): the lowering throws `UNSUPPORTED_FEATURE.BIN_BY` until the execution PR lands.\n\nThe output is the input columns plus three appended columns: `bin_start` and `bin_end` (matching the range column type) and `bin_distribute_ratio` (DOUBLE, the fraction of the original range that fell into the bin). All three are renameable.\n\n### Why are the changes needed?\n\nTelemetry and observability sources emit rows that each carry their own `[start, end)` measurement window. Re-bucketing such data onto a fixed grid today requires verbose SQL with manual boundary arithmetic, row explosion, and proportional splitting. `BIN BY` expresses this as a single relation operator.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The operator parses and resolves, but physical execution is intentionally stubbed in this PR (the strategy throws an `UNSUPPORTED_FEATURE` error), so `BIN BY` is not usable end to end yet; execution arrives in a follow-up PR. The 7 new keywords are non-reserved, so existing queries that use them as identifiers continue to parse unchanged.\n\n### How was this patch tested?\n\nNew unit tests, all passing:\n- `PlanParserSuite`: `BIN BY` parsing (minimal and maximal clauses, qualified column references, output renames, trailing alias, and the pipe form), parse-error cases, and confirmation that the new keywords remain usable as identifiers.\n- `ResolveBinBySuite`: resolution against the child output, session-zone capture, default-origin arithmetic (UTC, non-UTC, NTZ), output schema and renames, multipart disambiguation across a join, and every analysis-time error class.\n- `BinBySuite`: end-to-end check that a `BIN BY` query analyzes successfully but its physical execution surfaces `UNSUPPORTED_FEATURE.BIN_BY` (the interim stub) rather than an internal error.\n\n`build/sbt \u0027catalyst/testOnly *ResolveBinBySuite *PlanParserSuite\u0027` reports 107 tests passed.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #56247 from vranes/bin-by-parser.\n\nAuthored-by: Nikolina Vraneš \u003cnikolina.vranes@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "952a283d39a6c90cce20f33812a0b77b5c18c239",
      "tree": "b6b1e0d8db7717d50c058c2598421329aaa7c0c7",
      "parents": [
        "99db069b449697e2e24594b6f97dc7a5de60e5fa"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Mon Jun 08 13:48:03 2026 -0700"
      },
      "committer": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Mon Jun 08 13:48:03 2026 -0700"
      },
      "message": "[SPARK-57254][INFRA] Put CI-unrelated files in a module so CI won\u0027t be triggered\n\n### What changes were proposed in this pull request?\n\nPut all the CI-unrelated files in a module called `dev-tools` so they are correctly categorized - then they won\u0027t be considered part of \"root\" module and trigger a full CI.\n\nNotice that `lint` workflow is not impacted because it does not check changed files.\n\n### Why are the changes needed?\n\nTo reduce CI usage. According to Claude Code estimation, we could skip about 70 commits in the past year (2% of all commits).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56312 from gaogaotiantian/ignore-some-files.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\n"
    },
    {
      "commit": "99db069b449697e2e24594b6f97dc7a5de60e5fa",
      "tree": "7276f1902952c0f62c0b1a6aa28dc4b8bbfbd357",
      "parents": [
        "7129ce08419a52c164c1ca1b4bda15f0eaa8c6f9"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Mon Jun 08 21:02:24 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Mon Jun 08 21:02:24 2026 +0200"
      },
      "message": "[SPARK-57317][SQL] Fix Literal.create for external nanosecond timestamp values\n\n### What changes were proposed in this pull request?\n\n`Literal.create(value, dataType)` now routes the value through the schema-driven converter (`CatalystTypeConverters.createToCatalystConverter`) when the declared type contains a nanosecond timestamp type (`TimestampLTZNanosType` / `TimestampNTZNanosType`) anywhere, but only for external values. Values already in Catalyst internal form (`TimestampNanosVal`, `ArrayData`, `MapData`, `InternalRow`) and nulls keep using the lenient schema-less path, preserving the behavior of callers such as `Literal.default` that pass internal values.\n\n### Why are the changes needed?\n\n`Literal.create(value, dataType)` produced an invalid literal when the value was an external (high-level) nanosecond timestamp value (`java.time.Instant` / `java.time.LocalDateTime`, and arrays/maps/structs of them) and the declared type was a nanosecond timestamp type, or a complex type containing one.\n\nFor these types the method routed the value through the schema-less `CatalystTypeConverters.convertToCatalyst`, which by design (SPARK-57033) keeps bare `java.time.Instant` and `java.time.LocalDateTime` on the microsecond converters. As a result the produced Catalyst value was a `Long` (epoch micros) instead of the internal `TimestampNanosVal` representation expected by the declared type, and `Literal` validation failed, e.g.:\n\n```\njava.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to timestamp_ltz(7), but class Long found.\n```\n\nThe same problem affected collections of such values, e.g.:\n\n```\nLiteral must have a corresponding value to array\u003ctimestamp_ntz(9)\u003e, but class GenericArrayData found.\n```\n\nThis gap was surfaced while adding the nanosecond timestamp types to `DataTypeTestUtils` (SPARK-57259), which drives `PredicateSuite`\u0027s generic \"IN with different types\" coverage over these types.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Both nanosecond timestamp types are `Unstable` and unreleased; previously these `Literal.create` calls threw, so this only enables a path that did not work before.\n\n### How was this patch tested?\n\nAdded a unit test in `LiteralExpressionSuite` (\"SPARK-57317: create literals from external nanosecond timestamp values\") covering scalar, array, and struct nanosecond timestamp values, plus already-internal and null inputs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56371 from MaxGekk/spark-57317-literal-nanos.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "7129ce08419a52c164c1ca1b4bda15f0eaa8c6f9",
      "tree": "d8a34281cbf7b67d9989e4ed6883a3de5b8cef81",
      "parents": [
        "b55c2cc1a66a0bcf4a2857df15a59f5edfc40282"
      ],
      "author": {
        "name": "Adam Binford",
        "email": "adamq43@gmail.com",
        "time": "Mon Jun 08 10:50:21 2026 -0700"
      },
      "committer": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Mon Jun 08 10:50:21 2026 -0700"
      },
      "message": "[SPARK-37019][SQL] Add codegen support to array higher-order functions\n\n### What changes were proposed in this pull request?\n\nThis PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.\n\nA few notes:\n- Disabled subexpression elimination for lambda functions (this already was the case because it was CodegenFallback). I plan to explore supprting subexpression elimination inside lambda functions later on, as it will require special handling.\n- I set the AtomicReference for all lambda values as well in case a child expression reverts to interpreted evaluation for any reason (CodegenFallback or otherwise)\n\n### Why are the changes needed?\n\nTo improve performance of array higher-order function operations, letting the children be codegen\u0027d and participate in WholeStageCodegen\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, only performance improvements.\n\n### How was this patch tested?\n\nExisting unit tests, let me know if there\u0027s other codegen-specific unit tests I should add.\n\nCloses #34558 from Kimahriman/array-hof-codegen.\n\nAuthored-by: Adam Binford \u003cadamq43@gmail.com\u003e\nSigned-off-by: Chao Sun \u003cchao@openai.com\u003e\n"
    },
    {
      "commit": "b55c2cc1a66a0bcf4a2857df15a59f5edfc40282",
      "tree": "efee443ffc7f0aa29d4636cc722ce7b389803989",
      "parents": [
        "542ea3b60926b118775c39bcf5755f2c02fa43ad"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Mon Jun 08 10:20:23 2026 -0700"
      },
      "committer": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Mon Jun 08 10:20:23 2026 -0700"
      },
      "message": "[SPARK-57282][SQL] Spread NULL left anti join keys across shuffle partitions\n\n### What changes were proposed in this pull request?\n\nExtend `spark.sql.shuffle.spreadNullJoinKeys.enabled` to shuffled `LEFT ANTI`\nequi-joins when the preserved left-side join keys are nullable.\n\nThe planner requests the existing null-aware clustered distribution for eligible\nleft anti joins. Non-NULL keys retain normal hash placement, while NULL keys may\nbe spread across shuffle partitions. This PR also updates the configuration\ndocumentation.\n\nThe tests cover sort-merge and shuffled-hash left anti joins, including result\ncorrectness and null-aware shuffle partitioning, plus AQE coalescing of the\nresulting partitioning.\n\nThis follows the `LEFT ANTI` discussion in\nhttps://github.com/apache/spark/pull/55927.\n\n### Why are the changes needed?\n\nFor an ordinary `LEFT ANTI` equi-join, rows with NULL keys on the preserved left\nside cannot match and must be emitted. Standard hash partitioning sends all of\nthose rows to the same reducer, which can create severe shuffle skew.\n\nSpreading the NULL-keyed rows only changes their physical placement and\ntherefore reduces this skew without changing join results.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but only when `spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled.\nEligible shuffled left anti joins may spread NULL-keyed preserved rows across\nshuffle partitions. Query results are unchanged, and the configuration remains\ndisabled by default.\n\n### How was this patch tested?\n\n- `JAVA_HOME\u003d/opt/homebrew/opt/openjdk17/libexec/openjdk.jdk/Contents/Home ./build/sbt \"sql/testOnly org.apache.spark.sql.execution.joins.ExistenceJoinSuite\"` (120 tests passed)\n- `JAVA_HOME\u003d/opt/homebrew/opt/openjdk17/libexec/openjdk.jdk/Contents/Home ./build/sbt \"sql/testOnly org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z \u0027SPARK-57282: spread NULL keys for left anti join\u0027\"` (1 test passed)\n- `JAVA_HOME\u003d/opt/homebrew/opt/openjdk17/libexec/openjdk.jdk/Contents/Home ./dev/lint-scala`\n- `git diff --check`\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Codex GPT-5\n\nCloses #56348 from sunchao/dev/chao/codex/spread-null-left-anti-oss.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: Chao Sun \u003cchao@openai.com\u003e\n"
    },
    {
      "commit": "542ea3b60926b118775c39bcf5755f2c02fa43ad",
      "tree": "a2b9da739e9a2b7bc7c2e1929783d4970d765416",
      "parents": [
        "1592ec2de013d759f28f03b40fcb5bba82c89ac6"
      ],
      "author": {
        "name": "Andreas Chatzistergiou",
        "email": "andreas.chatzistergiou@databricks.com",
        "time": "Mon Jun 08 09:08:23 2026 -0700"
      },
      "committer": {
        "name": "Gengliang Wang",
        "email": "gengliang@apache.org",
        "time": "Mon Jun 08 09:08:23 2026 -0700"
      },
      "message": "[SPARK-56995][SQL][DML] Allow dataframe caching in the DSv2 Transaction API\n\nCurrently, the DSv2 Transaction API skips dataframe caching. This can cause significant performance regression to existing workloads. Dataframes cached prior to the transaction should be allowed to be reused within the transaction. Cache substitution during a transaction now delegates to the connector via `Transaction.registerScans`. Spark hands every materialized scan in a candidate cached subtree to the active transaction, and the connector decides whether reusing the cached snapshots is compatible with its isolation contract.\n\nFurthermore, this PR fixes an issue where the `v2TableReference` to Relation mechanism would not take into account subqueries. With the fix all pre-resolved relations in the plan (including subqueries) are un-resolved to `v2TableReference` and then resolved again at `ResolveRelations`.\n\n### What changes were proposed in this pull request?\n\nThis transaction introduces `Transaction.registerScans` in the transaction API. Check above for more details.\n\n### Why are the changes needed?\n\nWithout this fix cached dataframes cannot be used in transactions. As a result, the DSv2 transaction API will introduce significant performance regression for relevant workloads.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting and new tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nOpus 4.7\n\nCloses #56121 from andreaschat-db/dsv2TransactionDFCachingFix.\n\nAuthored-by: Andreas Chatzistergiou \u003candreas.chatzistergiou@databricks.com\u003e\nSigned-off-by: Gengliang Wang \u003cgengliang@apache.org\u003e\n"
    },
    {
      "commit": "1592ec2de013d759f28f03b40fcb5bba82c89ac6",
      "tree": "8ad0c81927044c73ca9bb0ac35d862499977ae11",
      "parents": [
        "b098a58a3d6ebd750d009905c4d2bf9b0e457c61"
      ],
      "author": {
        "name": "Sven Weber",
        "email": "sven.weber@databricks.com",
        "time": "Mon Jun 08 09:59:12 2026 -0400"
      },
      "committer": {
        "name": "Herman van Hövell",
        "email": "herman@databricks.com",
        "time": "Mon Jun 08 09:59:12 2026 -0400"
      },
      "message": "[SPARK-56661] Addressing review comments from PR #55768\n\n### What changes were proposed in this pull request?\n\nThis change addresses the open review comments from [pull request #55768](https://github.com/apache/spark/pull/55768). In summary, the following changes were made:\n\n  Correctness fixes (2)\n\n  1. SparkEnv.scala:161 — The created UDFDispatcherManager is now stored back into udfDispatcherManager \u003d\n  Some(created), so the cache works and stop() properly closes dispatchers.\n  2. ExternalUDFPlanner.scala + logical/physical nodes + strategy — profile: Option[ResourceProfile] is now\n  threaded through MapPartitionsExternalUDF (logical), MapPartitionsExternalUDFExec (physical),\n  SparkStrategies, and UnifiedExternalUDFPlanner, so a ResourceProfile passed to mapInPandas is no longer\n  silently dropped.\n\n  Nits (5)\n\n  1. ExternalUDFExec.scala:65 — Reworded \"CAN but MUST NOT cancel or close\" → \"may use the session but MUST\n  NOT cancel or close it\"\n  2. MapPartitionsExternalUDFExec.scala:39-42 — Fixed stale Scaladoc: functionExpr → function, removed\n  non-existent resultAttributes\n  3. UDFDispatcherManager.scala:33 — [[stop]] → [[close]]\n  4. UDFDispatcherManager.scala:49 — \"write lock is used by stop\" → \"close\"\n  5. UDFDispatcherManager.scala:66 — Removed duplicate \"quick path\" comment\n\n### Why are the changes needed?\n\nThey were requested in a review of the above linked PR.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nLinting \u0026 re-running existing tests. Mostly nit changes.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes\n\nCloses #56367 from sven-weber-db/sven-weber_data/SPARK-56661.\n\nAuthored-by: Sven Weber \u003csven.weber@databricks.com\u003e\nSigned-off-by: Herman van Hövell \u003cherman@databricks.com\u003e\n"
    },
    {
      "commit": "b098a58a3d6ebd750d009905c4d2bf9b0e457c61",
      "tree": "25e569bcc4d5e951a6edd918a770751ab60899a6",
      "parents": [
        "3e022571b3af068e42ba7e57794ef8bb1fbfca53"
      ],
      "author": {
        "name": "YangJie",
        "email": "yangjie01@baidu.com",
        "time": "Mon Jun 08 20:51:37 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Mon Jun 08 20:51:37 2026 +0800"
      },
      "message": "[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers\n\n### What changes were proposed in this pull request?\n\nThis is a sub-task of [SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce the size of generated Java code in whole-stage codegen).\n\n`RegExpExtract` and `RegExpExtractAll` inline the entire match-result-extraction logic into the generated Java produced by `doGenCode` (a `find()` / `toMatchResult()` / `checkGroupIndex` / `group(idx)` block, plus a `while` loop and an `ArrayList` accumulation for the `*All` variant). This duplicates the same logic that already exists in their `nullSafeEval` interpreted path and emits a large block into every generated class that uses these functions.\n\nThis PR extracts that logic into two shared helpers on the existing `object RegExpExtractBase` (placed next to `checkGroupIndex`, which the generated code already calls the same way):\n\n- `RegExpExtractBase.extract(matcher, idx, prettyName): UTF8String`\n- `RegExpExtractBase.extractAll(matcher, idx, prettyName): GenericArrayData`\n\nBoth `nullSafeEval` and `doGenCode` now call these helpers, so the generated Java is a single method call instead of an inline block. This mirrors the approach already used by `RegExpReplace` (`RegExpUtils.replace`, SPARK-57255 / #56315), reusing `RegExpExtractBase` here because `checkGroupIndex` is already co-located there.\n\n`RegExpInStr` (a third `RegExpExtractBase` subclass) is intentionally left unchanged: it returns the match start position rather than an extracted group, so these helpers do not apply.\n\nThe unused `java.util.regex.MatchResult` import and the now-dead codegen locals (`matchResult`, `matchResults`, `arrayClass`) are removed.\n\n### Why are the changes needed?\n\nSmaller generated methods reduce JIT/Janino pressure and the risk of hitting the 64KB method limit in wide whole-stage-codegen stages. Measured with `debugCodegen()` on a single-expression stage (`spark.range(1000).selectExpr(...)`):\n\n| Plan | `maxMethodCodeSize` | `maxConstantPoolSize` |\n|---|---|---|\n| `regexp_extract(cast(id as string), \u0027([0-9]+)\u0027, 1)` | 415 -\u003e 357 (-14.0%) | 260 -\u003e 239 (-8.1%) |\n| `regexp_extract_all(cast(id as string), \u0027([0-9])\u0027, 1)` | 569 -\u003e 477 (-16.2%) | 320 -\u003e 285 (-10.9%) |\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is a behavior-preserving refactor. The interpreted and codegen paths produce identical results, including the `INVALID_PARAMETER_VALUE.REGEX_GROUP_INDEX` error contract (the group-index check still runs only after a successful match, so a non-matching input never throws).\n\n### How was this patch tested?\n\nExisting `RegexpExpressionsSuite` tests for `RegExpExtract` and `RegExpExtractAll` pass (they exercise both interpreted and codegen via `checkEvaluation`, including the `REGEX_GROUP_INDEX` error path), and scalastyle is clean. No new test is needed because the refactor preserves behavior and the existing tests already cover both paths.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #56318 from LuciferYang/regexpextract-codegen-helper.\n\nAuthored-by: YangJie \u003cyangjie01@baidu.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "3e022571b3af068e42ba7e57794ef8bb1fbfca53",
      "tree": "6aa4445f59dd52777f73fcc8b82451bfa2cf64f4",
      "parents": [
        "e8ca2874188266191bbc83dec6bdfe81984a7d41"
      ],
      "author": {
        "name": "tonghuaroot (童话)",
        "email": "tonghuaroot@gmail.com",
        "time": "Mon Jun 08 19:00:06 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Mon Jun 08 19:00:06 2026 +0800"
      },
      "message": "[SPARK-57294][PS] Support DataFrame.combine in fallback mode\n\n### What changes were proposed in this pull request?\n\nThis PR enables `DataFrame.combine` for pandas-on-Spark through the\n`compute.pandas_fallback` path. Previously `combine` was declared via\n`_unsupported_function`, so it always raised `PandasNotImplementedError`.\nThis PR adds a `_combine_fallback` method to\n`pyspark.pandas.frame.DataFrame`, mirroring the existing\n`_asof_fallback` / `_set_axis_fallback` sibling methods, so that\n`__getattr__` dispatches `combine` through the generic\n`_build_fallback_method` when the fallback option is enabled.\n\nIt also adds tests covering both the disabled (raises\n`PandasNotImplementedError`) and the fallback-enabled behavior, plus the\nSpark Connect parity test, and registers them in\n`dev/sparktestsupport/modules.py`.\n\nJIRA: https://issues.apache.org/jira/browse/SPARK-57294\n\n### Why are the changes needed?\n\n`combine` is a useful pandas DataFrame API that was unsupported on\npandas-on-Spark even when users opted into `compute.pandas_fallback`.\nIt is a sound fallback candidate for the same reasons as the existing\nasof / set_axis fallbacks: its result is an ordinary single-level-index\nDataFrame whose column dtypes (for example int64) map cleanly onto Spark\ntypes, so the generic fallback round-trip through\n`ps.from_pandas` / `as_spark_type` succeeds. Wiring it through fallback\ncloses a gap in the pandas-on-Spark fallback coverage and gives users an\nexplicit, opt-in way to run `combine`.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. With `compute.pandas_fallback` enabled, calling\n`DataFrame.combine` on a pandas-on-Spark DataFrame now executes via the\npandas fallback path and returns a result instead of raising\n`PandasNotImplementedError`. A `PandasAPIOnSparkAdviceWarning` is emitted\nto indicate the call ran in fallback mode. When the option is disabled\n(the default), the behavior is unchanged and `PandasNotImplementedError`\nis still raised.\n\n### How was this patch tested?\n\nAdded `pyspark.pandas.tests.frame.test_combine` and the Spark Connect\nparity test `pyspark.pandas.tests.connect.frame.test_parity_combine`,\nboth registered in `dev/sparktestsupport/modules.py`. The classic test\ncovers two cases:\n\n- `test_disabled`: without `compute.pandas_fallback`, `combine` raises\n  `PandasNotImplementedError`.\n- `test_fallback`: with the option enabled, `combine` (including the\n  `overwrite\u003dFalse` case) produces results equal to pandas, asserted\n  with `assert_eq` (values and dtypes).\n\nRan `test_combine` against a real local SparkSession:\n\n```\n$ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v\ncollected 4 items\n... test_assert_classic_mode PASSED\n... CombineTests::test_assert_classic_mode PASSED\n... CombineTests::test_disabled PASSED\n... CombineTests::test_fallback PASSED\n4 passed in 11.32s\n```\n\nEnvironment: PySpark master (based on commit c082f824), pandas 2.2.3,\nPyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The\n`PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode`\nmessage confirms the call exercised the fallback path.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes, this patch was co-authored with generative AI tooling (Claude,\nAnthropic Opus 4.8). The contributor directed the change: choosing\n`combine` as the target, requiring that any fallback candidate first be\nverified to round-trip through the Spark type system before being\nproposed (which ruled out candidates such as `to_period` and\n`tz_localize`, whose result dtypes have no Spark mapping), and reviewing\nthe implementation and the test results. The AI tooling assisted with\ndrafting the implementation and the tests.\n\nCloses #56359 from tonghuaroot/pyspark-combine-fallback.\n\nAuthored-by: tonghuaroot (童话) \u003ctonghuaroot@gmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "e8ca2874188266191bbc83dec6bdfe81984a7d41",
      "tree": "5c06d17c8d3b2882b0b8f8646368a6970d0a8a3e",
      "parents": [
        "4d8b715523c15d4d4397063675a8fc59f86c1fe3"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Mon Jun 08 17:44:23 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Mon Jun 08 17:44:23 2026 +0800"
      },
      "message": "[SPARK-56830][INFRA] Share SBT compile artifact with python hosted runner CI jobs\n\n### What changes were proposed in this pull request?\n\nThis PR extends the SBT precompile-sharing pattern (parent: [SPARK-56830](https://issues.apache.org/jira/browse/SPARK-56830), pyspark: [SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768)) to the python-only macOS / ARM workflows that run via `.github/workflows/python_hosted_runner_test.yml`.\n\nConcretely:\n\n- New `precompile` job in `python_hosted_runner_test.yml` runs Spark\u0027s SBT build once on `${{ inputs.os }}`:\n  ```\n  ./build/sbt -Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive \\\n    -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver \\\n    -Pdocker-integration-tests -Pvolcano \\\n    Test/package streaming-kinesis-asl-assembly/assembly connect/assembly assembly/package\n  ```\n  It tars every `target/` directory (excluding `./build/` and `./.git/`) with `tar -czf`, uploads as `spark-compile-\u003cos\u003e-\u003cbranch\u003e-\u003crun_id\u003e` with `retention-days: 1`.\n- The 9 pyspark matrix entries in the same workflow add `precompile` to `needs:` and `if: (!cancelled())`, download/extract the artifact (with graceful fallback), and export `SKIP_SCALA_BUILD\u003dtrue` so `dev/run-tests.py` skips `build_apache_spark` and `build_spark_assembly_sbt`.\n- Cache steps in the new precompile job are gated `if: ${{ runner.os !\u003d \u0027macOS\u0027 }}` to match the existing TODO(SPARK-54466) pattern in this file: on `macos-26` the precompile runs without GHA cache; on `ubuntu-24.04-arm` it caches as expected.\n- Artifact name includes `${{ inputs.os }}` so the two callers (`build_python_3.12_macos26.yml` and `build_python_3.12_arm.yml`) cannot collide.\n\nThis benefits both callers of the reusable workflow:\n- `.github/workflows/build_python_3.12_macos26.yml` (macos-26)\n- `.github/workflows/build_python_3.12_arm.yml` (ubuntu-24.04-arm)\n\n### Optional: graceful fallback if precompile fails\n\nSame pattern as SPARK-56768:\n- `precompile` has `continue-on-error: true` so a failed or cancelled precompile does not fail the workflow run.\n- The matrix\u0027s \"Download precompiled artifact\" step is gated on `needs.precompile.result \u003d\u003d \u0027success\u0027` and itself has `continue-on-error: true`.\n- The \"Extract precompiled artifact\" step is gated on the download succeeding, and also has `continue-on-error: true`.\n- Inside the \"Run tests\" bash block, `SKIP_SCALA_BUILD\u003dtrue` is exported only when `steps.extract-precompiled.outcome \u003d\u003d \u0027success\u0027`. Otherwise it stays unset and `dev/run-tests.py` falls back to the original local SBT build.\n\n### Why are the changes needed?\n\nToday every one of the 9 pyspark matrix entries in `python_hosted_runner_test.yml` runs the same SBT build from scratch. Sharing the compile artifact once across the matrix avoids 8x duplicate SBT compile work per scheduled run of `build_python_3.12_macos26.yml` (and `build_python_3.12_arm.yml`). This mirrors the savings already realized for the Linux pyspark matrix in SPARK-56768.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. CI infrastructure change only.\n\n### How was this patch tested?\n\nThe change is exercised by the CI run of this PR itself. To validate the `macos-26` path specifically, a temporary PR-builder hook ran `python_hosted_runner_test.yml` on `macos-26` (dropped before merge); all 9 `Python on macOS` pyspark matrix entries passed while reusing the shared precompiled artifact:\n\nhttps://github.com/zhengruifeng/spark/actions/runs/27010550379\n\nIf the precompile job is forced to fail (or its artifact is missing), the matrix entries should still pass via the fallback path. The \"Run tests\" step logs `Reusing precompiled artifact, skipping local SBT build.` to make the fast path visible per matrix entry.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.7)\n\nCloses #56107 from zhengruifeng/share-sbt-compile-python-macos-dev5.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "4d8b715523c15d4d4397063675a8fc59f86c1fe3",
      "tree": "e1cea11aaa498116e0ed370c0063171b21e956f0",
      "parents": [
        "9c1adaf8aba2fb507e32a318f4018e58d56861df"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Mon Jun 08 17:17:27 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Mon Jun 08 17:17:27 2026 +0800"
      },
      "message": "[SPARK-57277][INFRA] Make CI cache keys OS-specific\n\n### What changes were proposed in this pull request?\n\nMake every CI cache `key:` and `restore-keys:` value OS-specific and give them a uniform shape: `\u003cprefix\u003e-${{ runner.os }}-\u003cenv\u003e-\u003chash\u003e` — the human-readable prefix leads, `${{ runner.os }}` follows, then any environment-specific component (e.g. the Java version, or Java + Hadoop for Coursier), then the `hashFiles(...)` hash.\n\n- `build_and_test.yml`: `build-`, `docs-maven-`, `tpcds-`, and the Coursier caches (`coursier-${{ runner.os }}-…`, including the matrix variant `coursier-${{ runner.os }}-${{ matrix.java }}-${{ matrix.hadoop }}-…`)\n- `build_sparkr_window.yml`: `build-sparkr-windows-maven-`\n- `maven_test.yml`: `build-`, and `maven-${{ runner.os }}-java${{ ... }}-` (Java version after the OS)\n- `build_python_connect.yml` / `build_python_connect40.yml`: `build-spark-connect-python-only-`, `coursier-build-spark-connect-python-only-`\n- `python_hosted_runner_test.yml`: `build-`, `coursier-`\n- `publish_snapshot.yml`: `snapshot-maven-`\n\nThe Coursier caches already embedded `${{ runner.os }}` (as a leading prefix); they are reordered here so that every cache key in the workflows follows the same `\u003cprefix\u003e-${{ runner.os }}-…` convention.\n\n### Why are the changes needed?\n\nWithout an OS component, cache entries from Linux, macOS, and Windows runners share the same key namespace. A cache written by one OS can be restored on another, causing subtle build failures or stale-artifact issues. Embedding `${{ runner.os }}` ensures each OS has its own isolated cache. Keeping the descriptive prefix first preserves readability, groups related caches together, and makes the keys consistent across all workflows.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI workflow change only; no code logic changed. The cache keys are verified by inspection.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8\n\nCloses #56342 from zhengruifeng/ci-cache-key-runner-os-dev7.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "9c1adaf8aba2fb507e32a318f4018e58d56861df",
      "tree": "3ceeacfaaf25ddb07b965f977f8106c333b7e0d1",
      "parents": [
        "96b255f16c6240cc2cdf7a8f747e9f00983f537c"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Mon Jun 08 17:15:14 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Mon Jun 08 17:15:14 2026 +0800"
      },
      "message": "[SPARK-57278][INFRA] Install zstd in CI container images to fix GitHub Actions cache\n\n### What changes were proposed in this pull request?\n\nInstall `zstd` in all CI container image Dockerfiles (`dev/infra/Dockerfile` and the `python-*`, `docs`, `lint`, `sparkr` images under `dev/spark-test-image/`).\n\n### Why are the changes needed?\n\n`actions/cache` has never successfully restored a cache in any container-based CI job — confirmed by `apache/spark`\u0027s cache history, which has no `pyspark-coursier-*` / `sparkr-coursier-*` / `docs-coursier-*` entry. This is a long-standing issue, present since container jobs were introduced.\n\n`actions/cache` computes a cache **version** \u003d `SHA256(paths + compression_method)` and includes it in the lookup URL. Host runners have `zstd` and use it; container images lack `zstd` and fall back to `gzip`. The version then differs, so caches saved by host jobs (e.g. the Coursier cache written by `precompile`) are invisible to container jobs even when the key matches. Installing `zstd` aligns the compression method.\n\n(The `build-` cache happened to work because it is written by both host and container jobs, so a gzip-version entry also existed; host-only caches had no gzip entry.)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. CI-only.\n\n### How was this patch tested?\n\nBefore (run [26956300346](https://github.com/zhengruifeng/spark/actions/runs/26956300346)): `precompile` saved `Linux-coursier-\u003chash\u003e`, but all container jobs reported `Cache not found` for the same key minutes later.\n\nAfter (run [26996424034](https://github.com/zhengruifeng/spark/actions/runs/26996424034)): `pyspark`, `sparkr`, `lint`, `docs` all `Cache restored from key: Linux-coursier-\u003chash\u003e`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (claude-sonnet-4-6)\n\nCloses #56324 from zhengruifeng/fix-container-coursier-cache-ci-cache-opt-dev6.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "96b255f16c6240cc2cdf7a8f747e9f00983f537c",
      "tree": "1c4ae8adb85b566699b36e93c524b394f3d64408",
      "parents": [
        "2660f4d13a787404bd728fbf47a597df205d0743"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Mon Jun 08 14:28:55 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Mon Jun 08 14:28:55 2026 +0900"
      },
      "message": "[SPARK-57262][SQL][WEBUI] Job description derived from a query should respect `spark.sql.redaction.string.regex`\n\n### What changes were proposed in this pull request?\nThis PR changes `SparkSQLDriver.scala` to redact a query before `setJobDescription`.\n\n### Why are the changes needed?\nIn the current implementation, when a query is executed through `SparkSQLDriver`, redaction is done in `SQLExecution.scala` so the description in the table on the top of `/SQL/execution` is redacted.\n\u003cimg width\u003d\"1083\" height\u003d\"349\" alt\u003d\"sql-execution-page-top-table\" src\u003d\"https://github.com/user-attachments/assets/b06fb255-2b46-473d-9046-1b2d578e3bda\" /\u003e\n\nBut the description in the table on the `/jobs` page and the one in the table on the bottom of `/SQL/execution` page are not redacted.\n\u003cimg width\u003d\"525\" height\u003d\"692\" alt\u003d\"jobs-page-before\" src\u003d\"https://github.com/user-attachments/assets/31c88b98-779b-4305-bf71-58f19a1d7117\" /\u003e\n\u003cimg width\u003d\"515\" height\u003d\"274\" alt\u003d\"sql-execution-page-before\" src\u003d\"https://github.com/user-attachments/assets/012be251-f642-4ded-8f77-32f811b05cac\" /\u003e\n\nNOTE:\nEven after this PR is merged, when a job description is set manually using `sc.setJobDescription`, the description displayed in the `/jobs` page and the one on the bottom of `/SQL/execution` page are not redacted though the one on the top of `SQL/execution` page is redacted.\n\n```\n$ bin/spark-shell -c spark.sql.redaction.string.regex\u003d\"secret.*\u003d.*\"\nscala\u003e val s \u003d \"SELECT * FROM (SELECT \u0027secret\u003d1\u0027)\"\nscala\u003e sc.setJobDescription(s)\nscala\u003e sql(s).show()\n+--------+\n|secret\u003d1|\n+--------+\n|secret\u003d1|\n+--------+\n```\n\n**description in `/jobs` page**\n\u003cimg width\u003d\"555\" height\u003d\"226\" alt\u003d\"jobs-page-not-redacted\" src\u003d\"https://github.com/user-attachments/assets/b4e084ad-b648-4ba6-b049-ef42f570398d\" /\u003e\n**description in `/SQL/execution` (top)**\n\u003cimg width\u003d\"913\" height\u003d\"203\" alt\u003d\"sql-execution-page-redacted\" src\u003d\"https://github.com/user-attachments/assets/91e745f0-aa7f-4618-98e9-5b4b117415da\" /\u003e\n**description in `/SQL/execution` (bottom)**\n\u003cimg width\u003d\"536\" height\u003d\"292\" alt\u003d\"sql-execution-page-not-redacted\" src\u003d\"https://github.com/user-attachments/assets/761aad76-0d1b-49af-9e03-58510cd474d1\" /\u003e\n\nThis is consistent with the previous behavior and not a regression. There is no simple way to redact them and doing it is out of scope of this PR.\n\n### Does this PR introduce _any_ user-facing change?\nYes.\n\n### How was this patch tested?\nAdded new test and confirmed the test `SQL execution description should respect spark.sql.redaction.string.regex` added in #56358 passed.\nAlso confirmed descriptions are redacted in UI.\n```\n$ bin/spark-sql --conf spark.sql.redaction.string.regex\u003d\"secret.*\u003d.*\"\nspark-sql (default)\u003e  CREATE TABLE test1(secret string);\nspark-sql (default)\u003e SELECT * FROM test1 WHERE secret\u003d1;\n```\n\u003cimg width\u003d\"607\" height\u003d\"213\" alt\u003d\"jobs-page-after-2\" src\u003d\"https://github.com/user-attachments/assets/62646cfc-67c3-46b5-a9f9-695b1f874462\" /\u003e\n\u003cimg width\u003d\"589\" height\u003d\"274\" alt\u003d\"sql-execution-page-after-2\" src\u003d\"https://github.com/user-attachments/assets/597db0da-58fb-4275-b6aa-7e8b301f15d0\" /\u003e\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Claude\n\nCloses #56361 from sarutak/fix-redact-sql-description-v2.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "2660f4d13a787404bd728fbf47a597df205d0743",
      "tree": "0183c036485a7cb7d8b2b0318bb2d30c657269c4",
      "parents": [
        "5077f7f12d7660532e9f1e76570b912fabc3b8af"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Mon Jun 08 07:21:19 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Mon Jun 08 07:21:19 2026 +0200"
      },
      "message": "[SPARK-57293][SQL] Cast between nanosecond-precision and microsecond-precision timestamp types\n\n### What changes were proposed in this pull request?\n\nThis PR adds explicit casts between nanosecond-precision timestamp types and their microsecond-precision counterparts, in both the interpreted and codegen paths:\n\n- `TIMESTAMP_NTZ` \u003c-\u003e `TIMESTAMP_NTZ(p)`\n- `TIMESTAMP_LTZ` \u003c-\u003e `TIMESTAMP_LTZ(p)` (`p` in `[7, 9]`)\n\nBoth directions stay within a single zone family, so they are pure representation conversions with no timezone involvement:\n\n- **Widening** (micros -\u003e nanos): `TimestampNanosVal.fromParts(micros, 0)`. Lossless and independent of the target precision `p` (the sub-microsecond part is always 0).\n- **Narrowing** (nanos -\u003e micros): takes `epochMicros`, dropping the sub-microsecond digits. Truncation toward the past (floor), consistent with how microsecond timestamps are already produced. Silent in both ANSI and non-ANSI modes, matching Spark\u0027s existing silent fractional-second truncation for timestamp casts.\n\nImplementation:\n- Registered the four pairs in `Cast.canCast` / `Cast.canAnsiCast`.\n- Added interpreted cases in `castToTimestamp` / `castToTimestampNTZ` (narrowing) and `castToTimestampLTZNanos` / `castToTimestampNTZNanos` (widening), and mirrored them in the corresponding codegen helpers.\n- No new `Cast.needsTimeZone` entries are required. The preview flag `spark.sql.timestampNanosTypes.enabled` continues to gate the nanosecond-typed side.\n\nOut of scope: precision-to-precision casts within the nanosecond family (`TIMESTAMP_NTZ(p1)` -\u003e `TIMESTAMP_NTZ(p2)`), cross-family casts (`TIMESTAMP_LTZ(p)` \u003c-\u003e `TIMESTAMP_NTZ(p)`), and implicit/up-cast/store-assignment coercion. These casts remain explicit-only, consistent with the existing string\u003c-\u003enanos casts.\n\nThis is a subtask of [SPARK-56822](https://issues.apache.org/jira/browse/SPARK-56822) (SPIP: Timestamps with nanosecond precision).\n\n### Why are the changes needed?\n\nNanosecond-precision timestamp types currently support parsing from strings (SPARK-57211) and rendering to strings (SPARK-57256), but there is no cast between a nanosecond-precision type and its microsecond-precision counterpart. As a result, values cannot move between `TIMESTAMP_NTZ(p)` and `TIMESTAMP_NTZ`, or between `TIMESTAMP_LTZ(p)` and `TIMESTAMP_LTZ`. This PR fills that gap.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but only behind the preview flag `spark.sql.timestampNanosTypes.enabled`. When enabled, users can now explicitly cast between the four new pairs, e.g.:\n\n```sql\nSELECT cast(cast(\u00272020-01-01 00:00:00.123456789\u0027 as timestamp_ntz(9)) as timestamp_ntz);\n-- 2020-01-01 00:00:00.123456\n```\n\nThe nanosecond-precision types are an unreleased, preview feature, so this is not a change compared to any released Spark version.\n\n### How was this patch tested?\n\n- Added unit tests in `CastSuiteBase` covering widening, narrowing/truncation (with a non-zero sub-microsecond part to prove flooring), round-trip, and null inputs for both NTZ and LTZ families. These run under `CastWithAnsiOnSuite` / `CastWithAnsiOffSuite` (ANSI on/off) and exercise both the interpreted and codegen paths.\n- Extended the existing `null cast` test to cover the four new pairs.\n- Added end-to-end golden coverage in `cast.sql` and regenerated the four `cast.sql.out` golden files.\n- Verified style with scalastyle.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor 2.0 (Claude Opus 4.8)\n\nCloses #56354 from MaxGekk/nanos-cast-micros.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "5077f7f12d7660532e9f1e76570b912fabc3b8af",
      "tree": "5f912d4ef0a0250665e92ded3aae2f1ee3c07f55",
      "parents": [
        "a6ac0b8109c02969d685908c37062566653918cc"
      ],
      "author": {
        "name": "YangJie",
        "email": "yangjie01@baidu.com",
        "time": "Mon Jun 08 10:30:24 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Mon Jun 08 10:30:24 2026 +0800"
      },
      "message": "[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper\n\n### What changes were proposed in this pull request?\n\nThis is a sub-task of [SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce generated Java size in whole-stage codegen).\n\n`RegExpReplace` inlined the same match/replace loop in both `nullSafeEval` and `doGenCode` — the matcher build, the `while (find) { appendReplacement }` loop with a `try/catch` that builds `invalidRegexpReplaceError(...)`, and `appendTail`. So every generated whole-stage class that uses `regexp_replace` carried that ~20-line block plus the matcher build and the error-construction constant-pool entries.\n\nThis PR moves it into a shared helper that takes the cached `Pattern` and builds the matcher itself:\n\n```scala\nobject RegExpUtils {\n  def replace(pattern: Pattern, subject: UTF8String, replacement: String,\n              regexp: UTF8String, rep: UTF8String, pos: Int): UTF8String \u003d { ... }\n}\n```\n\nBoth `eval` and codegen call it; `doGenCode` emits a single `RegExpUtils.replace(...)` call (via the Scala-object static forwarder, the same mechanism as the generated `QueryExecutionErrors.xxx` calls). Building the matcher inside the helper means `subject.toString` is computed exactly once per row (the previous inline forms decoded it twice). To give codegen the cached pattern term without building a matcher, `initLastPatternCode` is split out of `initLastMatcherCode` (the latter now composes the former plus the matcher line); the other callers (`RLike` / `RegExpExtract` / `RegExpExtractAll`) are unchanged.\n\nThe Pattern/replacement caching (the per-instance mutable state) stays at the call sites. The eval-only cached `result` `StringBuilder` field is dropped — the helper allocates its own, and the codegen path always allocated one per call anyway.\n\n### Why are the changes needed?\n\nTo reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. Moving the loop + matcher build + error construction into one compiled method removes them from every generated class. Measured with `debugCodegen()` on `spark.range(1000).selectExpr(\"regexp_replace(cast(id as string), \u00270\u0027, \u0027x\u0027)\")`, a single-`regexp_replace` whole-stage stage:\n\n| Metric | Before | After |\n|---|---|---|\n| `maxMethodCodeSize` | 551 | 423 (-23%) |\n| `maxConstantPoolSize` | 277 | 236 (-15%) |\n\nThese are real (not Janino-foldable) reductions, replicated per stage class, with the loop body compiled once.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is a refactor; `eval` results, `nullable`, `dataType`, and the error message/args are unchanged, so SQL output and golden files are unaffected.\n\nOne internal trade-off: interpreted `eval` now allocates a `StringBuilder` per row instead of reusing a cached field. This matches what the codegen path always did, and is negligible relative to the regex matching; it removes a `transient lazy val` field.\n\n### How was this patch tested?\n\n- `RegexpExpressionsSuite` \"RegexReplace\" — existing eval + codegen coverage, plus a new throwing-path case (a replacement referencing a non-existent group raises `INVALID_REGEXP_REPLACE`), which the tree previously did not cover; `checkErrorInExpression` exercises it in both interpreted and codegen modes. The full suite (17 tests, incl. the SPARK-22570 global-variable-count test and the other `initLastMatcherCode` callers) passes.\n- `StringFunctionsSuite` (`regexp_replace`) and `CollationSQLRegexpSuite`.\n\nFollow-up: `RegExpExtract` / `RegExpExtractAll` share a similar inline-loop shape and could get the same treatment in a separate sub-task.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.8)\n\nCloses #56315 from LuciferYang/regexpreplace-codegen-helper.\n\nAuthored-by: YangJie \u003cyangjie01@baidu.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "a6ac0b8109c02969d685908c37062566653918cc",
      "tree": "897c199fc3f9b6fbb29fe10b18b1fb573a3d42ad",
      "parents": [
        "4bc7196c940166e2aec1df3ca205bedaa3291cf6"
      ],
      "author": {
        "name": "Boyang Jerry Peng",
        "email": "jerry.peng@databricks.com",
        "time": "Mon Jun 08 07:13:16 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Mon Jun 08 07:13:16 2026 +0900"
      },
      "message": "[SPARK-57141][SS][RTM][STREAMINGSHUFFLE][PART3] Add StreamingShuffleManager and MultiShuffleManager\n\n### What changes were proposed in this pull request?\n\n  This is **part 3** of a multi-PR effort to add *streaming shuffle* to Spark — a push-based shuffle used by Real-Time Mode (RTM) structured streaming, where writer tasks push records\n  directly to reader tasks over the network instead of writing map output to disk for readers to pull.\n\n  This PR adds the shuffle-manager layer that later PRs plug into:\n\n  - **`StreamingShuffleManager`** — a `ShuffleManager` implementation for streaming shuffle. `getWriter`/`getReader` are intentionally stubbed in this PR (they throw\n  `UnsupportedOperationException`) and are implemented in the push-path / pull-path PRs that follow.\n  - **`MultiShuffleManager`** — routes each shuffle to either the batch `SortShuffleManager` or the `StreamingShuffleManager`, based on a per-query local property, so a single application\n  can mix batch and streaming shuffle.\n  - **`TaskContextAwareLogging`** — a `Logging` mixin that prefixes log lines with queryId / shuffleId / stageId / taskId.\n  - **`SparkEnv`** — exposes the `StreamingShuffleOutputTracker` (added in part 2) to executors, and initializes it **only** when the configured shuffle manager is `StreamingShuffleManager`\n  or `MultiShuffleManager`.\n  - Two streaming-shuffle error conditions (`STREAMING_SHUFFLE_INCORRECT_SEQUENCE_NUMBER`, `STREAMING_SHUFFLE_UNEXPECTED_MESSAGE_TYPE`) and the `STREAMING_QUERY_ID` log key.\n\n  The full PR stack:\n\n  - **Part 1** (SPARK-56674, *merged*) — streaming shuffle wire protocol (Netty messages).\n  - **Part 2** (SPARK-56962, *merged*) — `StreamingShuffleOutputTracker` (driver-side writer-location coordination).\n  - **Part 3** (*this PR*) — shuffle-manager layer (`StreamingShuffleManager` + `MultiShuffleManager`), logging mixin, and SparkEnv tracker wiring.\n  - **Part 4** — `StreamingShuffleWriter` + server-side Netty handler (push path).\n  - **Part 5** — `StreamingShuffleReader` + client-side Netty handler (pull path).\n  - **Part 6** — register streaming shuffles with the tracker in `DAGScheduler` (activation).\n  - **Part 7** — end-to-end `StreamingShuffleSuite`.\n  - **Part 8** — documentation.\n\n  ### Why are the changes needed?\n\n  Real-Time Mode / low-latency continuous queries need shuffle data to flow continuously between stages. The default sort shuffle (write map output to disk, then have reducers pull it) adds\n  latency that is unacceptable for these workloads. Streaming shuffle instead pushes records directly from writer tasks to reader tasks.\n\n  This PR lands the manager layer that the writer and reader implementations attach to, plus `MultiShuffleManager` so batch stages keep using the sort shuffle while streaming stages use the\n  streaming shuffle within the same application.\n\n  ### Does this PR introduce _any_ user-facing change?\n\n  No. The new shuffle managers are opt-in via `spark.shuffle.manager` and are not the default; `getWriter`/`getReader` are still stubbed in this PR, so the feature is not yet usable\n  end-to-end (completed in later PRs). The `StreamingShuffleOutputTracker` is initialized only when one of the new managers is configured, so there is no change to the default (sort\n  shuffle) path — this is covered by tests.\n\n  ### How was this patch tested?\n\n  New unit suites:\n\n  - **`StreamingShuffleManagerSuite`** — `getWriterId` for data/termination messages and the unexpected-message-type error; `getQueryId` resolution and failure; `registerShuffle` handle\n  type; and SparkEnv gating (tracker is present for `StreamingShuffleManager`, absent for the default manager).\n  - **`MultiShuffleManagerSuite`** — per-query streaming-vs-batch routing, the enable property, and SparkEnv gating for `MultiShuffleManager`.\n\n  13 tests, all passing. `SparkThrowableSuite` validates the two new error conditions.\n\n  ### Was this patch authored or co-authored using generative AI tooling?\n\n  Co-authored with Claude Code (Claude Opus 4.8)\n\nCloses #56196 from jerrypeng/stack/streaming-shuffle-pr3-managers.\n\nAuthored-by: Boyang Jerry Peng \u003cjerry.peng@databricks.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "4bc7196c940166e2aec1df3ca205bedaa3291cf6",
      "tree": "5c6749c88af0ec1e0143d6e3045ed12e4d019363",
      "parents": [
        "c9e7421168369f2c9704ced12fdc5f644b9f817f"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Sun Jun 07 22:43:10 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Sun Jun 07 22:43:10 2026 +0900"
      },
      "message": "[SPARK-57297][SQL][TESTS] Add a test that SQL execution description respects `spark.sql.redaction.string.regex`\n\n### What changes were proposed in this pull request?\n\nThis PR adds a test to `SQLExecutionSuite` that verifies the SQL execution description (`SparkListenerSQLExecutionStart.description`) is redacted according to `spark.sql.redaction.string.regex`.\n\n### Why are the changes needed?\n\n`SQLExecution` redacts the job description before it is recorded in `SparkListenerSQLExecutionStart`, but there is no test covering this behavior. This test guards the redaction so it is not accidentally dropped.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is a test-only change.\n\n### How was this patch tested?\n\nPass the CI with the newly added test.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Opus 4.8)\n\nCloses #56358 from dongjoon-hyun/SPARK-57297.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "c9e7421168369f2c9704ced12fdc5f644b9f817f",
      "tree": "8730b2241f79c331ca7ba45b77beff5ee633e94a",
      "parents": [
        "583e5bb0010b52a0201a092985b09b0b2c264a6a"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Sat Jun 06 16:48:18 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Sat Jun 06 16:48:18 2026 -0700"
      },
      "message": "Revert \"[SPARK-57262][SQL][WEBUI] Job description derived from a query should respect `spark.sql.redaction.string.regex`\"\n\nThis reverts commit 583e5bb0010b52a0201a092985b09b0b2c264a6a.\n"
    },
    {
      "commit": "583e5bb0010b52a0201a092985b09b0b2c264a6a",
      "tree": "17fbc3c9cc6ff8f59686143b491cb51bf9449aae",
      "parents": [
        "b66d392689ffa66cdea7df94a145c3f7068e4a60"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Sun Jun 07 01:41:27 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Sun Jun 07 01:41:27 2026 +0900"
      },
      "message": "[SPARK-57262][SQL][WEBUI] Job description derived from a query should respect `spark.sql.redaction.string.regex`\n\n### What changes were proposed in this pull request?\nThis PR changes `SparkSQLDriver.scala` to redact a query before `setJobDescription`.\n\n### Why are the changes needed?\nIn the current implementation, redaction is done in `SQLExecution.scala` so the description in the table on the top of `/SQL/execution` is redacted.\n\u003cimg width\u003d\"1083\" height\u003d\"349\" alt\u003d\"sql-execution-page-top-table\" src\u003d\"https://github.com/user-attachments/assets/b06fb255-2b46-473d-9046-1b2d578e3bda\" /\u003e\n\nBut the description in the table on the `/jobs` page and the one in the table on the bottom of `/SQL/execution` page are not redacted.\n\u003cimg width\u003d\"525\" height\u003d\"692\" alt\u003d\"jobs-page-before\" src\u003d\"https://github.com/user-attachments/assets/0a5a8ce8-e4be-4669-bd7d-a6c62fe316ca\" /\u003e\n\u003cimg width\u003d\"515\" height\u003d\"274\" alt\u003d\"sql-execution-page-before\" src\u003d\"https://github.com/user-attachments/assets/bd0406cc-5b0b-40a0-96c4-9f9fa1aa048a\" /\u003e\n\n### Does this PR introduce _any_ user-facing change?\nYes.\n\n### How was this patch tested?\nAdded new test.\nAlso confirmed descriptions are redacted in UI.\n```\n$ bin/spark-sql --conf spark.sql.redaction.string.regex\u003d\"secret.*\u003d.*\"\nspark-sql (default)\u003e  CREATE TABLE test1(secret string);\nspark-sql (default)\u003e SELECT * FROM test1 WHERE secret\u003d1;\n```\n\u003cimg width\u003d\"852\" height\u003d\"690\" alt\u003d\"jobs-page-after\" src\u003d\"https://github.com/user-attachments/assets/8e28e37e-369f-479c-9711-999b431756db\" /\u003e\n\u003cimg width\u003d\"598\" height\u003d\"272\" alt\u003d\"sql-execution-page-after\" src\u003d\"https://github.com/user-attachments/assets/cb734556-619b-45c6-a7f6-d52e60132aff\" /\u003e\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Claude\n\nCloses #56326 from sarutak/fix-redact-sql-description.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "b66d392689ffa66cdea7df94a145c3f7068e4a60",
      "tree": "8730b2241f79c331ca7ba45b77beff5ee633e94a",
      "parents": [
        "c082f824d4e2e837792fdbb2c381fb77a71ccd9f"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Sun Jun 07 01:38:33 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Sun Jun 07 01:38:33 2026 +0900"
      },
      "message": "[SPARK-57284][PYTHON][SQL] Add Scala/Python bindings for vector functions\n\n### What changes were proposed in this pull request?\nThis PR adds following Scala/Python bindings for vector functions which were added in SPARK-55030 (#53924), SPARK-55593 (#54368) and SPARK-55031 (#54011)\n\n### Why are the changes needed?\nFor better built-in function parity.\n\n### Does this PR introduce _any_ user-facing change?\nYes, new built-in functions introduced.\n\n### How was this patch tested?\nNew tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Claude.\n\nCloses #56322 from sarutak/pyspark-vector-functions.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "c082f824d4e2e837792fdbb2c381fb77a71ccd9f",
      "tree": "39803beeb9694b1c9f60e3acfd045c3374dec6ed",
      "parents": [
        "ccffd010bb52ec5a1e9d54506bd854522da0234f"
      ],
      "author": {
        "name": "cxzl25",
        "email": "sychen@ctrip.com",
        "time": "Sat Jun 06 00:41:27 2026 -0500"
      },
      "committer": {
        "name": "Mridul Muralidharan",
        "email": "mridulatgmail.com",
        "time": "Sat Jun 06 00:41:27 2026 -0500"
      },
      "message": "[SPARK-56645][CORE] Fix History Server serving stale UI after app completes\n\n### What changes were proposed in this pull request?\nWhen `mergeApplicationListing()` successfully parses a completed event log, it now proactively\ndeletes any existing disk store for the app if the app\u0027s UI is not currently tracked in\n`activeUIs`. Concretely, the following lines are added after the `invalidateUI()` call in\n`doMergeApplicationListingInternal`:\n\n```scala\nif (app.attempts.head.info.completed) {\n  val hasActiveUI \u003d synchronized { activeUIs.contains((appId, attemptId)) }\n  if (!hasActiveUI) {\n    diskManager.foreach(_.release(appId, attemptId, delete \u003d true))\n  }\n}\n```\n\n### Why are the changes needed?\nThere is a race condition between `ApplicationCache`\u0027s LRU eviction and\n`FsHistoryProvider.invalidateUI()` that causes the History Server to serve stale UI data\nafter an in-progress app completes.\n\n### Does this PR introduce _any_ user-facing change?\nYes. After this fix, users who access the History Server UI for an application that completed after its UI was evicted from the `ApplicationCache` will see the fully-completed application data (all jobs, stages, and the final application end event), instead of a stale snapshot from when the UI was last loaded while the app was still in progress.\n\n### How was this patch tested?\nAdd test to FsHistoryProviderSuite\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: GitHub Copilot\n\nCloses #55578 from cxzl25/SPARK-56645.\n\nAuthored-by: cxzl25 \u003csychen@ctrip.com\u003e\nSigned-off-by: Mridul Muralidharan \u003cmridul\u003cat\u003egmail.com\u003e\n"
    },
    {
      "commit": "ccffd010bb52ec5a1e9d54506bd854522da0234f",
      "tree": "6932b09b932a719ad776107ecbf2833a8df1d11c",
      "parents": [
        "f3f567775a62f9ca3119fc12de01ab212d25dff7"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Sat Jun 06 06:06:24 2026 +0200"
      },
      "committer": {
        "name": "Uros Bojanic",
        "email": "221401595+uros-b@users.noreply.github.com",
        "time": "Sat Jun 06 06:06:24 2026 +0200"
      },
      "message": "[SPARK-57256][SQL] Cast nanosecond-precision timestamps to string\n\n### What changes were proposed in this pull request?\nImplement casting of the nanosecond-precision timestamp types `TIMESTAMP_NTZ(p)` (`TimestampNTZNanosType`) and `TIMESTAMP_LTZ(p)` (`TimestampLTZNanosType`), `p` in [7, 9], to `STRING`.\n\nCasting is implemented in `ToStringBase` (mixed into `Cast`), so this change also fixes `ToPrettyString` (and therefore `Dataset.show()`) for these types via the shared base.\n\nThe change wires the [SPARK-57162](https://issues.apache.org/jira/browse/SPARK-57162) formatter methods into the existing cast-to-string paths (interpreted and codegen):\n- `TimestampLTZNanosType(p)` -\u003e `TimestampFormatter.formatNanos(v, p)` (renders in the session time zone).\n- `TimestampNTZNanosType(p)` -\u003e `TimestampFormatter.formatWithoutTimeZoneNanos(v, p)` (zone-independent, UTC wall-clock grid).\n\nThe fractional-second precision `p` is taken from the source type; sub-`p` digits are floored and trailing zeros are trimmed, consistent with the microsecond cast path (both use `FractionTimestampFormatter`).\n\n`Cast.needsTimeZone` is extended so that `TimestampLTZNanosType -\u003e StringType` resolves the session time zone (mirroring `TimestampType -\u003e StringType`); the NTZ variant does not need a time zone.\n\n### Why are the changes needed?\nToday `Cast` permits these casts at analysis time (the generic `(_, StringType)` rule), but at runtime the nanosecond types have no dedicated case in `ToStringBase` and fall through to the default `String.valueOf(...)` branch, producing the internal form `TimestampNanosVal(epochMicros, nanosWithinMicro)` instead of a proper SQL timestamp string. Producing a correct textual representation is a prerequisite for nanosecond support in expressions, SHOW/pretty output, and downstream text-based sinks.\n\n### Does this PR introduce _any_ user-facing change?\nUser-facing only when `spark.sql.timestampNanosTypes.enabled\u003dtrue`; these types are not available otherwise. Casting to string never fails, so ANSI and non-ANSI modes behave identically.\n\nWith `spark.sql.timestampNanosTypes.enabled\u003dtrue`:\n```sql\nSELECT CAST(ts AS STRING);\n-- TIMESTAMP_NTZ(9) value 2020-01-01 00:00:00.123456789\n--   before: TimestampNanosVal(1577836800000000, 789)\n--   after:  2020-01-01 00:00:00.123456789\n```\n\n### How was this patch tested?\nNew cases in `CastSuiteBase` (run under both ANSI on/off; `checkEvaluation` exercises the interpreted and codegen paths): precision 7/8/9, trailing-zero trimming, `nanosWithinMicro` 0 and 999, LTZ time-zone shift under a non-UTC session zone vs. NTZ remaining unshifted, pre-epoch and year-9999 boundaries, and null input.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Cursor\n\nCloses #56317 from MaxGekk/cast-nanos-to-string.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Uros Bojanic \u003c221401595+uros-b@users.noreply.github.com\u003e\n"
    },
    {
      "commit": "f3f567775a62f9ca3119fc12de01ab212d25dff7",
      "tree": "9bd6a187cec82c208a05492a1ebad7cdcb58983d",
      "parents": [
        "e113afcfb8812050123bde2acb8630ddf25f051b"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Sat Jun 06 09:39:44 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Sat Jun 06 09:39:44 2026 +0800"
      },
      "message": "[SPARK-57247][SQL][CONNECT] Support DataFrame.zip in Spark Connect\n\n### What changes were proposed in this pull request?\n\nThis is the follow-up to #54976 ([SPARK-55886]) which implemented `DataFrame.zip` for the classic path and deferred Spark Connect support. This PR wires up the Connect path end-to-end.\n\n- **Protocol (`relations.proto`)**: adds a `Zip` message with `left` and `right` `Relation` fields (field 48 in the `Relation` oneof). Python stubs regenerated via the `connect-gen-protos` Docker image (buf 1.66.1 + mypy 1.19.1 + mypy-protobuf 3.3.0 + ruff 0.14.8).\n- **Server (`SparkConnectPlanner`)**: adds `transformZip` that directly constructs the unresolved `logical.Zip(left, right)` plan, dispatched via `RelTypeCase.ZIP`. `ResolveZip` then runs during analysis, same as the classic path.\n- **Scala Connect `Dataset`**: replaces the `UnsupportedOperationException` stub with `sparkSession.newDataFrame { builder \u003d\u003e builder.getZipBuilder.setLeft(...).setRight(...) }`, following the `crossJoin`/`buildJoin` pattern.\n- **Python Connect `plan.py`**: adds `class Zip(LogicalPlan)` following the `NearestByJoin` pattern.\n- **Python Connect `dataframe.py`**: replaces the `PySparkNotImplementedError` stub with a `plan.Zip` call; removes the doctest suppression (`del DataFrame.zip.__doc__`) that was added when Connect was unsupported.\n\n### Why are the changes needed?\n\n`DataFrame.zip` was merged (#54976) with Connect deferred. This PR completes the implementation so Connect users can use `zip` on equal footing with the classic path.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. `DataFrame.zip` now works on the Spark Connect path. Previously it raised `PySparkNotImplementedError: [NOT_IMPLEMENTED] zip is not implemented.`\n\n### How was this patch tested?\n\n- `test_parity_zip.py`: runs the full `DataFrameZipTestsMixin` (basic projections, expressions, one-sided base, `withColumn`, chained `withColumn`, longer chains, parent-with-chained-child, `withColumnRenamed`, scalar Python UDF, pandas UDF, and two error cases) against a Connect session.\n- `test_connect_plan.py`: asserts that the proto plan for `left.zip(right)` has the `zip` field set with the expected left/right sources.\n- `PlanGenerationTestSuite`: serializes a `zip` plan to proto and compares against a new golden file (`zip.proto.bin`).\n- `ProtoToParsedPlanTestSuite`: deserializes the proto golden file, runs it through `SparkConnectPlanner` + `Analyzer`, and compares the explained plan against `zip.explain`.\n- `DataFrameSuite` (Connect): end-to-end test that zips two projections over a Connect session and asserts the resulting columns and values.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #56300 from zhengruifeng/spark-dev-2-df-zip-connect-dev2.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "e113afcfb8812050123bde2acb8630ddf25f051b",
      "tree": "7434a4864564d36bfd1b7c34c60ffe3c67769b00",
      "parents": [
        "060a617c5054aac981444d6dc8dceee18cd9cf02"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Fri Jun 05 16:20:59 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Fri Jun 05 16:20:59 2026 -0700"
      },
      "message": "[SPARK-57286][BUILD] Add `wildfly-openssl-macosx-aarch64` dependency to support Apple Silicon\n\n### What changes were proposed in this pull request?\n\nFor Apache Spark 5.0.0, this PR aims to add the `wildfly-openssl-macosx-aarch64` artifact as a dependency of the `hadoop-cloud` module to support `Apple Silicon` when `hadoop-cloud` profile is enabled.\n\nThis artifact is available from `wildly-openssl` `2.3.0.Final` and now Apache Spark uses it since SPARK-57283.\n- https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl-macosx-aarch64/2.3.0.Final/\n- #56347\n\n### Why are the changes needed?\n\nThe main `wildfly-openssl` artifact does not ship the native OpenSSL binding for macOS on Apple Silicon (`macosx-aarch64`). Without this platform-specific artifact, `wildfly-openssl` fails to load the native OpenSSL library on Apple Silicon and falls back to the JSSE implementation. Bundling the `macosx-aarch64` variant enables native OpenSSL acceleration (e.g., for S3A via `hadoop-aws`) on Apple Silicon machines.\n\n**BEFORE (Apple Silicon Mac)**\n```\n$ build/sbt package -Phadoop-cloud\n$ bin/spark-shell -c spark.hadoop.fs.s3a.aws.credentials.provider\u003dsoftware.amazon.awssdk.auth.credentials.ProfileCredentialsProvider -c spark.hadoop.fs.s3a.ssl.channel.mode\u003dopenssl --driver-java-options \"-Dorg.wildfly.openssl.path\u003d/opt/homebrew/opt/openssl3/lib\"\n...\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  \u0027_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 5.0.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.13.18 (OpenJDK 64-Bit Server VM, Java 21.0.11)\n...\nscala\u003e spark.read.text(\"s3a://.../README.md\")\n26/06/05 14:41:23 WARN FileSystem: Failed to initialize filesystem s3a://.../README.md: java.io.IOException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: openssl.TLS, provider: openssl, class: org.wildfly.openssl.OpenSSLContextSPI$OpenSSLTLSContextSpi)\n26/06/05 14:41:23 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://dongjoon/README.md.\njava.io.IOException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: openssl.TLS, provider: openssl, class: org.wildfly.openssl.OpenSSLContextSPI$OpenSSLTLSContextSpi)\n...\n```\n\n**AFTER (Apple Silicon Mac)**\n```\n$ build/sbt package -Phadoop-cloud\n$ bin/spark-shell -c spark.hadoop.fs.s3a.aws.credentials.provider\u003dsoftware.amazon.awssdk.auth.credentials.ProfileCredentialsProvider -c spark.hadoop.fs.s3a.ssl.channel.mode\u003dopenssl --driver-java-options \"-Dorg.wildfly.openssl.path\u003d/opt/homebrew/opt/openssl3/lib\"\n...\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  \u0027_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 5.0.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.13.18 (OpenJDK 64-Bit Server VM, Java 21.0.11)\n...\nscala\u003e spark.read.text(\"s3a://.../README.md\")\nval res0: org.apache.spark.sql.DataFrame \u003d [value: string]\n\nscala\u003e\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nNo behavior change. This is just additional small jar file with 23552 bytes.\n\n```\n$ ls -al *jar\n-rw-r--r-- 1 dongjoon  staff  23552 Mar 13 07:06 wildfly-openssl-macosx-aarch64-2.3.0.Final.jar\n```\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.8\n\nCloses #56349 from dongjoon-hyun/SPARK-57286.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "060a617c5054aac981444d6dc8dceee18cd9cf02",
      "tree": "e4118a51c525346b7251f1c0f975bacc39939a5d",
      "parents": [
        "637803e983456b9c6455c5ac4a55ed6720665ff7"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Fri Jun 05 14:31:10 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Fri Jun 05 14:31:10 2026 -0700"
      },
      "message": "[SPARK-57283][BUILD] Upgrade `wildfly-openssl` to 2.3.0.Final\n\n### What changes were proposed in this pull request?\n\nThis PR aims to upgrade `wildfly-openssl` to 2.3.0.Final for Apache Spark 5.0.0 (2027)\n\n### Why are the changes needed?\n\n`wildfly-openssl` `2.2.5.Final` was released on 2022-08-03 (about 4 years ago). We had better make it up-to-date for Spark 5.0.0 (2027) in order to bring new improvements and bug fixes:\n- https://github.com/wildfly-security/wildfly-openssl/releases/tag/2.3.0.Final (2026-03-12)\n  - https://github.com/wildfly-security/wildfly-openssl/pull/151\n  - https://github.com/wildfly-security/wildfly-openssl/pull/155\n\n### Does this PR introduce _any_ user-facing change?\n\nNo behavior change.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.8)\n\nCloses #56347 from dongjoon-hyun/SPARK-57283.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "637803e983456b9c6455c5ac4a55ed6720665ff7",
      "tree": "73f8a21fd2804445387e6f712dfc1291a07846b5",
      "parents": [
        "042ad7d0c4ac1c4d3e9fdeb48e2695fdeb861135"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 05 21:09:50 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 05 21:09:50 2026 +0200"
      },
      "message": "[SPARK-57257][SQL] Support nanosecond-precision timestamps in Hive results\n\n### What changes were proposed in this pull request?\nThis PR modifies `HiveResult` to support the nanosecond-precision timestamp types `TIMESTAMP_LTZ(p)` (`TimestampLTZNanosType`) and `TIMESTAMP_NTZ(p)` (`TimestampNTZNanosType`), `p` in [7, 9]. Two cases are added to `HiveResult.toHiveStringDefault`, mirroring the existing microsecond timestamp cases:\n\n- `(i: Instant, _: TimestampLTZNanosType)` -\u003e rendered in the session time zone.\n- `(l: LocalDateTime, _: TimestampNTZNanosType)` -\u003e rendered zone-independently.\n\nThe external collected values are `Instant` (LTZ) and `LocalDateTime` (NTZ); they are converted to the physical `TimestampNanosVal` at the column precision and formatted with the nanosecond-aware `TimestampFormatter` (`formatNanos` / `formatWithoutTimeZoneNanos`, SPARK-57162), flooring sub-`p` digits and trimming trailing zeros. This is the same rendering used by casting these types to string (SPARK-57256), so Hive output stays consistent.\n\n### Why are the changes needed?\nBefore the change, formatting a nanosecond timestamp column through `HiveResult` (e.g. end-to-end SQL / golden-file tests, spark-sql CLI, Thrift server output) hits the catch-all match and fails with a `MatchError`, analogous to the `TimeType` issue fixed in SPARK-51517:\n\n```\nscala.MatchError\n(2020-01-01T00:00:00.123456789Z, TimestampLTZNanosType(9)) (of class scala.Tuple2)\n```\n\n### Does this PR introduce _any_ user-facing change?\nYes. It fixes the error above. After the change, nanosecond timestamp values are rendered as proper strings in Hive results (only reachable when `spark.sql.timestampNanosTypes.enabled\u003dtrue`).\n\n### How was this patch tested?\n- New cases in `HiveResultSuite` covering `TIMESTAMP_LTZ(p)` / `TIMESTAMP_NTZ(p)` for `p` in [7, 9]: precision-driven fraction width, trailing-zero trimming, nanosWithinMicro 0 and 999, LTZ session-zone rendering vs. zone-independent NTZ, and nested (array/map/struct) values.\n- New golden-file end-to-end tests `timestamp-ltz-nanos.sql` and `timestamp-ntz-nanos.sql` (as SPARK-51517 added `time.sql`), disabled in `ThriftServerQueryTestSuite`.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Cursor 1.7.0\n\nCloses #56320 from MaxGekk/nanos-hiveresult.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "042ad7d0c4ac1c4d3e9fdeb48e2695fdeb861135",
      "tree": "ce956a4459aab6f5c8220c075d68a4ce0201f407",
      "parents": [
        "0536814d3e0fee135decac0da3f405f86e75140f"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Fri Jun 05 09:57:03 2026 -0700"
      },
      "committer": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Fri Jun 05 09:57:03 2026 -0700"
      },
      "message": "[SPARK-57176][SQL] Extend nested column pruning through array-returning functions\n\n### Why are the changes needed?\n\n[SPARK-57176](https://issues.apache.org/jira/browse/SPARK-57176) follows [SPARK-57022](https://issues.apache.org/jira/browse/SPARK-57022), which added nested column pruning for `transform` over `array\u003cstruct\u003e` inputs.\n\nArray-returning functions still retain the complete input element struct even when downstream expressions and lambdas only require a subset of nested fields. For example:\n\n```sql\nSELECT filter(friends, friend -\u003e friend.last \u003d \u0027Smith\u0027).first\nFROM contacts\n```\n\nIf `friends` contains `first`, `middle`, and `last`, Spark currently reads all three fields even though the query only requires `first` and `last`.\n\n### What changes were proposed in this PR?\n\n- Merge downstream result-field requirements with lambda requirements for `filter` and comparator-based `array_sort`.\n- Propagate projected element schemas through `reverse`, `shuffle`, `slice`, and `array_compact`.\n- Rewrite bound lambda variable types and nested field ordinals after pruning.\n- Retain the complete element schema when the whole result is used, when a lambda consumes the whole element, or when default `array_sort` natural ordering requires the full struct.\n\nFunctions that inspect full element equality or natural ordering remain out of scope because dropping nested fields could change results.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. Eligible queries using array-returning functions over arrays of structs can read a narrower input schema. Query results and SQL APIs are unchanged.\n\n### How was this patch tested?\n\n- `JAVA_HOME\u003d/opt/homebrew/opt/openjdk17/libexec/openjdk.jdk/Contents/Home PATH\u003d/opt/homebrew/opt/openjdk17/bin:$PATH build/sbt \"catalyst/testOnly org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite\" \"sql/testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z Array\"`\n- `JAVA_HOME\u003d/opt/homebrew/opt/openjdk17/libexec/openjdk.jdk/Contents/Home PATH\u003d/opt/homebrew/opt/openjdk17/bin:$PATH build/sbt catalyst/scalastyle sql/scalastyle`\n- `git diff --check`\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Codex (GPT-5)\n\nCloses #56227 from sunchao/dev/chao/codex/spark-array-returning-function-pruning.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: Chao Sun \u003cchao@openai.com\u003e\n"
    },
    {
      "commit": "0536814d3e0fee135decac0da3f405f86e75140f",
      "tree": "385d4490075b6af21fb030c5c2625809a4ba5b94",
      "parents": [
        "a32cda3d9632c7ce5966fc8e869449d5665f8df4"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Fri Jun 05 08:53:10 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Fri Jun 05 08:53:10 2026 -0700"
      },
      "message": "[SPARK-57273][BUILD] Upgrade jackson to 2.21.4\n\n### What changes were proposed in this pull request?\n\nThis PR upgrades `FasterXML` `Jackson` to 2.21.4.\n\n### Why are the changes needed?\n\n- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.21.4 (2026-05-28)\n  - https://github.com/FasterXML/jackson-core/pull/1611\n  - https://github.com/FasterXML/jackson-databind/pull/5931\n  - https://github.com/FasterXML/jackson-databind/pull/5950\n  - https://github.com/FasterXML/jackson-databind/pull/5951\n  - https://github.com/FasterXML/jackson-databind/pull/5967\n  - https://github.com/FasterXML/jackson-databind/pull/5969\n  - https://github.com/FasterXML/jackson-databind/pull/5971\n  - https://github.com/FasterXML/jackson-databind/pull/5974\n  - https://github.com/FasterXML/jackson-databind/issues/5981\n  - https://github.com/FasterXML/jackson-databind/pull/5988\n  - https://github.com/FasterXML/jackson-databind/issues/5993\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.8)\n\nCloses #56338 from dongjoon-hyun/SPARK-57273.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "a32cda3d9632c7ce5966fc8e869449d5665f8df4",
      "tree": "447888bf9faf7556c7a2bc3627f1c73ac2d53e5d",
      "parents": [
        "b2580fc795ba6b9d36f414e28a670501b6d8077f"
      ],
      "author": {
        "name": "Joel Robin P",
        "email": "joelrobin1818@gmail.com",
        "time": "Fri Jun 05 11:27:50 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 05 11:27:50 2026 +0200"
      },
      "message": "[SPARK-57260][SQL] Fix variable resolution in REPLACE WHERE clause of INSERT INTO\n\n### What changes were proposed in this pull request?\nThis PR fixes variable resolution in the REPLACE WHERE clause of INSERT INTO statements.\n\nREPLACE WHERE is represented as OverwriteByExpression.deleteExpr during analysis. Previously, this expression was resolved only against the target table output because resolveExpressionByPlanOutput was called without includeLastResort \u003d true.\n\nThis PR enables last-resort resolution for OverwriteByExpression.deleteExpr, allowing SQL variables declared with DECLARE to be resolved in REPLACE WHERE predicates while preserving table-column precedence.\n\n### Why are the changes needed?\n[SPARK-57260](https://issues.apache.org/jira/browse/SPARK-57260) reports that SQL variables can be used in the VALUES clause of INSERT INTO, but not in the REPLACE WHERE clause.\n\nFor example, this previously failed during analysis:\n\n```\nBEGIN\n  DECLARE x INT DEFAULT 1;\n  INSERT INTO table_y\n    REPLACE WHERE y \u003d x\n    VALUES (x);\nEND\n```\nThe predicate y \u003d x could not resolve x as a SQL variable, resulting in an unresolved column/variable error.\n\n### Does this PR introduce any user-facing change?\nYes.\n\nBefore this change, INSERT INTO ... REPLACE WHERE could not resolve SQL variables declared with DECLARE in the REPLACE WHERE predicate and failed during analysis with an unresolved column/variable error.\n\nAfter this change, INSERT INTO ... REPLACE WHERE can resolve SQL variables declared with DECLARE in the REPLACE WHERE predicate.\n\n### How was this patch tested?\nAdded test coverage for variable resolution in INSERT INTO ... REPLACE WHERE, including:\n\nsession variables\nSQL scripting local variables\ntable-column precedence over SQL scripting variables\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Cursor GPT-5.5  and Claude Code Opus 4.8\n\nCloses #56321 from joelrobin18/SPARK-57260-fix-replace-where-variable-resolution.\n\nAuthored-by: Joel Robin P \u003cjoelrobin1818@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    },
    {
      "commit": "b2580fc795ba6b9d36f414e28a670501b6d8077f",
      "tree": "608fd88935840273ad479928a309a75b63827ff0",
      "parents": [
        "9e32a26d3782de4ccd4870d89164514db6e91e64"
      ],
      "author": {
        "name": "YangJie",
        "email": "yangjie01@baidu.com",
        "time": "Fri Jun 05 14:24:58 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Fri Jun 05 14:24:58 2026 +0800"
      },
      "message": "[SPARK-57263][SQL] Support Hive 4.2 metastore\n\n### What changes were proposed in this pull request?\n\nThis PR adds Hive `4.2.0` as a supported metastore client version, following 4.0 (SPARK-45265) and 4.1 (SPARK-53095).\n\n- Add `hive.v4_2` with the `extraDeps` taken from the Hive 4.2 POM. A few datanucleus/jdo deps are actually lower than 4.1 (`datanucleus-api-jdo` 6.0.3 vs 6.0.5, `datanucleus-core` 6.0.10 vs 6.0.11, `javax.jdo` 3.2.0 vs 3.2.1), while Derby is bumped to `10.17.1.0` for Java 21. There is a note in `package.scala` so these don\u0027t get \"fixed\" upward later.\n- `Shim_v4_2` extends `Shim_v4_1`. The shimmed method signatures are unchanged between 4.1 and 4.2, so the body is empty.\n- Hive 4.2 is compiled with `maven.compiler.target\u003d21`, so its jars cannot load on an older JVM. When a 4.2 client is constructed on Java \u003c 21, it now fails with `UNSUPPORTED_HIVE_METASTORE_VERSION_FOR_JAVA` instead of a raw `UnsupportedClassVersionError`. The check lives on the client-construction path rather than in `hiveVersion()`, so config validation still resolves `4.2.0` normally.\n- `HiveClientVersions` includes `4.2` in the test sweep only on Java 21+.\n- Update the supported metastore version range in the docs.\n\n### Why are the changes needed?\n\nHive 4.2.0 is released and supports JDK 21. Users on Java 21 should be able to connect Spark to a Hive 4.2 metastore via `spark.sql.hive.metastore.version\u003d4.2.0`.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. `4.2.0` is now a valid value for `spark.sql.hive.metastore.version` (Java 21+). On an older JVM, setting it fails fast with a clear message.\n\n### How was this patch tested?\n\n- Adding `4.2` to `HiveClientVersions` runs it through `HiveClientSuites`, which loads the client via the isolated classloader (requires network access to download the 4.2 jars).\n- Checked the shimmed methods in `Hive.java` and `IMetaStoreClient.java` against Hive 4.2 source; no differences from 4.1.\n- `build/sbt \u0027core/testOnly *SparkThrowableSuite\u0027` for the new error condition.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #56337 from LuciferYang/worktree-SPARK-hive42-metastore.\n\nAuthored-by: YangJie \u003cyangjie01@baidu.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "9e32a26d3782de4ccd4870d89164514db6e91e64",
      "tree": "f92bcb85c688e64d0ec4ab6048822e27ebe40a64",
      "parents": [
        "49153409945c5ebe2cd8894fb2d1de10bdcf3497"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 05 07:44:40 2026 +0200"
      },
      "committer": {
        "name": "Uros Bojanic",
        "email": "221401595+uros-b@users.noreply.github.com",
        "time": "Fri Jun 05 07:44:40 2026 +0200"
      },
      "message": "[SPARK-57250][SQL] Construct sub-microsecond timestamp typed literals with precision derived from fractional digits\n\n### What changes were proposed in this pull request?\n\nPer the ANSI SQL standard (ISO/IEC 9075-2, Subclause 5.3, Syntax Rule 27), the fractional-seconds precision of a typed timestamp literal is the number of digits in its `\u003cseconds fraction\u003e`. This PR makes the typed timestamp literals `TIMESTAMP \u0027...\u0027`, `TIMESTAMP_NTZ \u0027...\u0027`, and `TIMESTAMP_LTZ \u0027...\u0027` construct nanosecond-capable values, deriving the precision `p` from the fractional-digit count, when the SQL config `spark.sql.timestampNanosTypes.enabled` is enabled:\n\n- 7-9 fractional digits -\u003e `TimestampNTZNanosType(p)` / `TimestampLTZNanosType(p)`\n- `\u003c\u003d` 6 fractional digits -\u003e existing microsecond behavior (unchanged)\n- `\u003e` 9 fractional digits -\u003e new `INVALID_TIMESTAMP_LITERAL_PRECISION` parse error\n\nConcretely:\n- Add `SparkDateTimeUtils.fractionalSecondsDigits(String): Int` to count the digits of the seconds fraction in a timestamp string.\n- Route `AstBuilder.visitTypeConstructor` for `TIMESTAMP`/`TIMESTAMP_NTZ`/`TIMESTAMP_LTZ` to the nanosecond parse helpers when the preview flag is on and the literal carries 7-9 fractional digits, preserving the existing NTZ/LTZ resolution rules for the bare `TIMESTAMP` keyword.\n- Add the `INVALID_TIMESTAMP_LITERAL_PRECISION` error condition and a `QueryParsingErrors.timestampLiteralPrecisionExceedsMaxError` factory for literals with more than 9 fractional-second digits.\n\n### Why are the changes needed?\n\nTo support standard-compliant sub-microsecond (nanosecond) timestamp typed literals as part of the nanosecond timestamp preview umbrella (SPARK-56822). The ANSI SQL literal grammar carries no explicit precision argument; the precision is implied by the number of fractional-second digits, so the parser must derive it from the literal text.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but only when the preview flag `spark.sql.timestampNanosTypes.enabled` is set to `true`. In that case a typed timestamp literal with 7-9 fractional-second digits now produces a nanosecond-precision value instead of being truncated to microseconds, and a literal with more than 9 fractional digits raises `INVALID_TIMESTAMP_LITERAL_PRECISION`. With the flag off (the default), behavior is unchanged: literals are parsed at microsecond precision as before.\n\n### How was this patch tested?\n\nAdded cases to `ExpressionParserSuite` (\"SPARK-57250: nanosecond timestamp typed literals\") covering:\n- 7/8/9-digit fractional precision for `TIMESTAMP_NTZ`, `TIMESTAMP_LTZ`, and bare `TIMESTAMP`;\n- NTZ/LTZ resolution based on the keyword and the presence of a time-zone part;\n- boundary values (pre-epoch, max year, `nanosWithinMicro` 0 and 999);\n- the 6-digit-stays-microsecond regression;\n- the `\u003e9` fractional-digit error;\n- the flag-off regression (microsecond behavior preserved).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude Opus 4.8)\n\nCloses #56306 from MaxGekk/spark-57250-typed-literals.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Uros Bojanic \u003c221401595+uros-b@users.noreply.github.com\u003e\n"
    },
    {
      "commit": "49153409945c5ebe2cd8894fb2d1de10bdcf3497",
      "tree": "5379e8ba1635d8b4ff65e2f0b787abf15e9388a9",
      "parents": [
        "e4462556fafcc3aca42ec06896ac1ba75b52766a"
      ],
      "author": {
        "name": "Maxim Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 05 03:19:46 2026 +0200"
      },
      "committer": {
        "name": "Max Gekk",
        "email": "max.gekk@gmail.com",
        "time": "Fri Jun 05 03:19:46 2026 +0200"
      },
      "message": "[SPARK-57207][SQL] Support nanosecond timestamp types in the Types Framework\n\n### What changes were proposed in this pull request?\n\nThis PR wires `TimestampNTZNanosType(p)` and `TimestampLTZNanosType(p)` (p in [7, 9]) through the Spark SQL Types Framework (SPARK-53504), so that all type-specific behavior for the nanosecond timestamp types is centralized behind `TypeOps` / `TypeApiOps`. The nanos types are now supported **only** through the framework: the scattered legacy dispatch for them is removed.\n\nConcretely:\n- Add `TimestampNanosTypeOps` (catalyst) with `TimestampNTZNanosTypeOps` / `TimestampLTZNanosTypeOps`, registered in `TypeOps.apply()`. Overrides: `getPhysicalType`, `getJavaClass`, `getBoxedJavaClass`, `getRowWriter`, `getDefaultLiteral`, `getJavaLiteral`, `getMutableValue`, `toCatalystImpl`, `toScala`/`toScalaImpl`, `createSerializer`, `createDeserializer`.\n- Add a `getBoxedJavaClass` hook to the `TypeOps` base (the boxed Java class used in codegen). The `createSerializer` / `createDeserializer` hooks already exist on the base trait (used by `TimeTypeOps`); the nanos ops above only override them.\n- Add `TimestampNanosTypeApiOps` (sql/api) with NTZ/LTZ subclasses, registered in `TypeApiOps.apply()`. `getEncoder` returns the SPARK-57033 leaves (`LocalDateTimeNanosEncoder(p)` / `InstantNanosEncoder(p)`), gated by `DataTypeErrors.checkTimestampNanosTypesEnabled()`.\n- Remove the nanos branches from the legacy code paths now handled by the framework: `SerializerBuildHelper`, `DeserializerBuildHelper`, `CatalystTypeConverters`, `EncoderUtils`, `CodeGenerator`, `Literal`, and `InternalRow`. In `SerializerBuildHelper` / `DeserializerBuildHelper`, `OptionEncoder` / `TransformingEncoder` are unwrapped before the framework leaf dispatch, since those wrapper encoders proxy `dataType` to the wrapped encoder.\n- Add `MutableTimestampNanos` to `SpecificInternalRow` to avoid the `MutableAny` fallback.\n- Add a `checkValue` on `spark.sql.timestampNanosTypes.enabled` requiring `spark.sql.types.framework.enabled\u003dtrue`, so the types cannot be enabled outside the framework.\n\nFractional-second string formatting is not implemented yet (no `TimestampFormatter` for these types). Until it lands, converting a nanos value to a string (CAST to STRING, EXPLAIN/SHOW output, SQL-literal rendering) raises the new `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` error rather than silently truncating to microseconds. Both the interpreted path (`TimestampNanosTypeApiOps.format`) and the codegen path (`ToStringBase.castToStringCode`) raise the identical error, so the two eval modes stay consistent.\n\nOut of scope (follow-ups): string formatting/CAST-to-string, Connect proto, Arrow, PySpark conversion, Parquet/ColumnVector, and physical ordering/compare/hash.\n\n### Why are the changes needed?\n\nThe logical nanosecond timestamp types (SPARK-56876) and the physical row layer (SPARK-56981) already exist, but these types were wired only through scattered legacy dispatch. Centralizing the type-specific operations behind `TypeOps`, consistent with `TimeType`, is a prerequisite for the remaining nanosecond timestamp work and avoids the framework-on/off behavior divergence that the previous per-call-site handling produced.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The nanosecond timestamp types are a preview feature gated by `spark.sql.timestampNanosTypes.enabled` (and `spark.sql.types.framework.enabled`), both off by default in production. When these preview flags are enabled, converting a nanos timestamp to a string raises `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING` because fractional-second formatting is not implemented yet.\n\n### How was this patch tested?\n\nAdded/updated tests:\n- `TimestampNanosTypeOpsSuite` (catalyst): `TypeOps`/`TypeApiOps` registration; `PhysicalDataType`, default `Literal`, and codegen Java class; `InternalRow`/`SpecificInternalRow` roundtrips incl. the dedicated `MutableTimestampNanos` holder; `getEncoder` returns the SPARK-57033 nanos encoders; `CatalystTypeConverters` `java.time` roundtrip; `format`/`toSQLValue` raise `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING`; framework-disabled leaves the types unsupported (no legacy fallback); enabling the nanos types requires the framework flag.\n- `TimestampNanosTypeOpsSuite` also covers Option-wrapped nanos encoder roundtrips (Some/None for NTZ and LTZ), verifying wrapper encoders are unwrapped before the framework serde dispatch.\n- `TimestampNanosRowSuite` (catalyst): CAST nanos -\u003e STRING raises the unsupported-feature error in both interpreted and codegen modes; unsafe/generic row roundtrips; literal validation.\n\n```\nbuild/sbt \u0027catalyst/testOnly *TimestampNanosTypeOpsSuite *TimestampNanosRowSuite\u0027\nbuild/sbt \u0027core/testOnly org.apache.spark.SparkThrowableSuite\u0027\n```\n\nAll tests pass. `catalyst` / `sql-api` scalastyle are clean.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor\n\nCloses #56266 from MaxGekk/nanos-types-typeops.\n\nAuthored-by: Maxim Gekk \u003cmax.gekk@gmail.com\u003e\nSigned-off-by: Max Gekk \u003cmax.gekk@gmail.com\u003e\n"
    }
  ],
  "next": "e4462556fafcc3aca42ec06896ac1ba75b52766a"
}
