)]}'
{
  "log": [
    {
      "commit": "2aff1f6e4cef82a9897631866a817e149fc692d2",
      "tree": "7cbf21b8201a72a8644abbedaabf3ee17ef27720",
      "parents": [
        "7d8df99ca8664dfea32f25d92fff9cc39e415ec5"
      ],
      "author": {
        "name": "Johan Lasperas",
        "email": "johan.lasperas@databricks.com",
        "time": "Thu Apr 09 09:54:10 2026 -0700"
      },
      "committer": {
        "name": "Anton Okolnychyi",
        "email": "aokolnychyi@apache.org",
        "time": "Thu Apr 09 09:54:10 2026 -0700"
      },
      "message": "[SPARK-55689] Skip unsupported column changes during schema evolution\n\n### What changes were proposed in this pull request?\nThe initial implementation of schema evolution in MERGE/INSERT is too aggressive when trying to automatically apply schema evolution: any type mismatch between the source data and target table triggers an attempt to change the target data type, even though the table may not support it.\n\nThis change adds a new DSv2 trait `SupportsSchemaEvolution` that lets connectors indicate whether a given column change should be applied or not.\n\n### Why are the changes needed?\nWhen schema evolution is enabled, the following write currently fails if the connector doesn\u0027t support changing the type of `value` from STRING to INT:\n```\nCREATE TABLE t (key INT, value STRING);\nINSERT WITH SCHEMA EVOLUTION INTO t VALUES (1, 1)\n```\nOn the other hand, the write succeeds without schema evolutio, a cast from INT to STRING is added, which is valid.\n\n### Does this PR introduce _any_ user-facing change?\nYes, the following query now succeeds instead of trying - and failing - to change data type of `value` to INT:\n```\nCREATE TABLE t (key INT, value INT);\nINSERT WITH SCHEMA EVOLUTION INTO t VALUES (1, \"1\")\n```\n\n### How was this patch tested?\nAdded tests for type evolution in INSERT and MERGE INTO\n\nCloses #54658 from johanl-db/dsv2-type-evolution.\n\nLead-authored-by: Johan Lasperas \u003cjohan.lasperas@databricks.com\u003e\nCo-authored-by: Anton Okolnychyi \u003caokolnychyi@apache.org\u003e\nSigned-off-by: Anton Okolnychyi \u003caokolnychyi@apache.org\u003e\n"
    },
    {
      "commit": "7d8df99ca8664dfea32f25d92fff9cc39e415ec5",
      "tree": "511f8aa2e09dd82f96c24e278e59cee4a28bdb33",
      "parents": [
        "6423cb0f8e0c8bf6cc7a4bce78eb7a9fa31da153"
      ],
      "author": {
        "name": "Ivan Sadikov",
        "email": "ivan.sadikov@databricks.com",
        "time": "Thu Apr 09 20:33:10 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 09 20:33:10 2026 +0800"
      },
      "message": "[SPARK-56019][SQL] Close JDBC connection on task kill to unblock native socket reads\n\n### What changes were proposed in this pull request?\n\nNative JDBC socket reads (e.g. `socketRead0` in `ResultSet.next()` and `PreparedStatement.executeBatch()`) do not respond to `Thread.interrupt()`. When the task reaper kills a task blocked in either call, the thread never unblocks and lingers indefinitely. This registers a `TaskInterruptListener` on both the read path (`JDBCRDD.compute`, just before `executeQuery()`) and the write path (`JdbcUtils.savePartition`, just after the connection is opened). On kill, the listener closes the `Connection`, which tears down the underlying TCP socket and causes the blocked native call to throw a `SQLException`. The existing `finally` blocks in both paths are updated to tolerate a connection that was already closed by the listener.\n\n### Why are the changes needed?\nTasks blocked in native JDBC reads silently ignore `Thread.interrupt()`, so the task reaper cannot terminate them. The threads pile up, exhaust the executor\u0027s thread pool, and eventually hang the executor. Closing the connection from the interrupt-listener side is the correct fix: JDBC 4.0 §9.6 requires all methods on a closed `Connection` to throw `SQLException`, and major drivers (PostgreSQL, MySQL, H2) implement this reliably. Closes SPARK-56019.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nAdded `JdbcTaskInterruptSuite` covering both paths. Tests use mock JDBC objects and `CountDownLatch` to simulate threads blocked after `executeQuery()` (in `ResultSet.next()`) and in `executeBatch()`. Each test fires the interrupt listener while the mock is blocking, asserts `conn.close()` is called, and confirms the task thread unblocks and propagates the `SQLException`. A separate test verifies that a `rollback()` failure on an already-closed connection does not mask the original exception.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #55268 from sadikovi/SPARK-56019.\n\nAuthored-by: Ivan Sadikov \u003civan.sadikov@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "6423cb0f8e0c8bf6cc7a4bce78eb7a9fa31da153",
      "tree": "8f776be9f8afad0ecda6e879ca4df87a6e181066",
      "parents": [
        "3851cb50e254c8364bd440d0ef0e1fb28966eacf"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Thu Apr 09 19:42:40 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Thu Apr 09 19:42:40 2026 +0900"
      },
      "message": "[SPARK-56407][BUILD][TESTS] Remove pre-built class files and JARs used in artifact transfer tests\n\n### What changes were proposed in this pull request?\nThis PR is a part of SPARK-56352 for artifact transfer test files, replacing pre-built .class files, JAR files, CRC files, and serialized binaries with dynamic generation at test time, removing 20 binary/text files from the repository.\n\nChanges:\n- Update `ArtifactManagerSuite` (sql/core) to dynamically compile Java and Scala source files into .class files and JARs at test time using `createJarWithJavaSources()` and `createJarWithScalaSources()`.\n- Update `ArtifactSuite` (sql/connect/client) to generate test artifacts(smallClassFile.class, smallJar.jar, largeJar.jar, etc.) in a temp directory and compute CRC values dynamically using `java.util.zip.CRC32`.\n- Update `ClassFinderSuite` (sql/connect/client) to generate dummy .class files in a temp directory instead of reading from pre-built resources.\n- Update `AddArtifactsHandlerSuite` (sql/connect/server) to generate test artifacts and CRC files dynamically in a temp directory.\n- Update `SparkConnectClientSuite` (sql/connect/client) to use a dynamically generated temp file instead of referencing the deleted artifact-tests directory.\n- Update `test_artifact.py` (PySpark) to generate test JAR files in a temp directory and compute CRC values dynamically using `zlib.crc32`.\n- Empty `dev/test-jars.txt` and `dev/test-classes.txt` as all listed files are now removed.\n\nNote on test artifacts: The artifact transfer tests (ArtifactSuite, AddArtifactsHandlerSuite, test_artifact.py) only verify byte-level transfer protocol (chunking, CRC), so the generated .class files and JAR entries contain arbitrary bytes rather than valid Java class files. In contrast, ArtifactManagerSuite requires valid class files for classloader testing and uses  createJarWithJavaSources()`/`createJarWithScalaSources()` accordingly.\n\nFiles removed:\n- `data/artifact-tests/junitLargeJar.jar`\n- `data/artifact-tests/smallJar.jar`\n- `data/artifact-tests/crc/` (3 files: README.md, junitLargeJar.txt, smallJar.txt)\n- `sql/connect/common/src/test/resources/artifact-tests/` (11 files: Hello.class, smallClassFile.class, smallClassFileDup.class, smallJar.jar, junitLargeJar.jar, crc/*.txt)\n- `sql/core/src/test/resources/artifact-tests/` (4 .class files: Hello.class, HelloWithPackage.class, IntSumUdf.class, smallClassFile.class)\n\n### Why are the changes needed?\nAs noted in the PR discussion (https://github.com/apache/spark/pull/50378):\n\u003e the ultimate goal is to refactor the tests to automatically build the jars instead of using pre-built ones\n\nThis PR completes that goal by removing all remaining pre-built test artifacts. After this change, no binary artifacts remain in the source tree for test purposes, and the release-time workaround (SPARK-51318) becomes fully unnecessary.\n\nNote: `dev/test-jars.txt` and `dev/test-classes.txt` are left as empty files because `dev/create-release/release-tag.sh` reads them with `rm $(\u003cdev/test-jars.txt)`, which would fail if the files were deleted. A follow-up PR will update the release script and remove these files.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\n- ArtifactManagerSuite (sql/core)\n- ArtifactSuite, ClassFinderSuite, SparkConnectClientSuite (sql/connect/client)\n- AddArtifactsHandlerSuite (sql/connect/server) — fails identically on origin/master when run standalone via SBT due to a pre-existing session initialization issue unrelated to this change\n- test_artifact.py (PySpark)\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Opus 4.6\n\nCloses #55272 from sarutak/remove-test-jars-d.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "3851cb50e254c8364bd440d0ef0e1fb28966eacf",
      "tree": "2b293d26b01bc3e42343738e85edfe800c924c14",
      "parents": [
        "4018cc730f66e258fb7b1a9edeea12b45b8ba29c"
      ],
      "author": {
        "name": "Tengfei Huang",
        "email": "tengfei.huang@databricks.com",
        "time": "Thu Apr 09 15:03:57 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Thu Apr 09 15:03:57 2026 +0800"
      },
      "message": "[SPARK-56302][CORE] Free task result memory eagerly during serialization on executor\n\n### What changes were proposed in this pull request?\nEagerly null intermediate objects during task result serialization in `Executor` to reduce peak heap memory usage.\n\nDuring result serialization in `TaskRunner.run()`, three representations of the result coexist on the heap simultaneously:\n1. `value` — the raw task result object from `task.run()`\n2. `valueByteBuffer` — first serialization of the result\n3. `serializedDirectResult` — second serialization wrapping the above into a `DirectTaskResult`\n\nEach becomes dead as soon as the next is produced, but none were released.\nThis PR nulls each reference as soon as it\u0027s no longer needed:\n- `value \u003d null` after serializing into `valueByteBuffer`\n- `valueByteBuffer \u003d null` and `directResult \u003d null` after re-serializing into `serializedDirectResult`\n\nAll changes are confined to the executor side within `TaskRunner.run()`, where the variables are local and not exposed to other components.\n\n### Why are the changes needed?\nFor tasks returning large results (e.g. `collect()` on large datasets), the redundant copies can roughly triple peak memory during serialization, increasing GC pressure or causing executor OOM. Eagerly freeing dead references lets the GC reclaim memory sooner.\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nExisting UTs\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude Code v2.1.88\n\nCloses #55110 from ivoson/free-result-memory-asap.\n\nLead-authored-by: Tengfei Huang \u003ctengfei.huang@databricks.com\u003e\nCo-authored-by: Tengfei Huang \u003ctengfei.h@gmail.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "4018cc730f66e258fb7b1a9edeea12b45b8ba29c",
      "tree": "4bc7b3cf871c2827bbc9f8e9bc3b43ed616fc3d9",
      "parents": [
        "cf26c421258e2882a44e681cb0cdf3e24b37d825"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 09 09:09:26 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 09 09:09:26 2026 +0800"
      },
      "message": "[SPARK-56338][INFRA][FOLLOWUP] Support MAVEN_MIRROR_URL in SBT launcher bootstrap\n\n### What changes were proposed in this pull request?\n\nThis is a follow-up to #55168. The SBT launcher script (`build/sbt-launch-lib.bash`) did not respect `MAVEN_MIRROR_URL`. This patch adds support for it in two places:\n\n1. **Launcher JAR download**: `MAVEN_MIRROR_URL` is used as the default for `DEFAULT_ARTIFACT_REPOSITORY` when neither is explicitly set.\n2. **SBT boot phase**: When `MAVEN_MIRROR_URL` is set, a temporary SBT repositories config is generated and passed via `-Dsbt.repository.config` so SBT resolves itself and Scala through the mirror.\n\n### Why are the changes needed?\n\nSPARK-56338 added `MAVEN_MIRROR_URL` support for Maven (`pom.xml`) and SBT project builds (`SparkBuild.scala`), but the SBT launcher script was not covered. In environments where default Maven repositories are unreachable, the SBT launcher JAR download and boot phase still fail without manual configuration (e.g. `~/.sbt/repositories`).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nManually tested by setting `MAVEN_MIRROR_URL` and building Spark with SBT from scratch (launcher JAR removed, no `~/.sbt/repositories`).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #55238 from cloud-fan/spark-56338-sbt-mirror.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "cf26c421258e2882a44e681cb0cdf3e24b37d825",
      "tree": "998c4b4247d283e2cd4e52ae47b6c92976d6a25b",
      "parents": [
        "05f6f69bebca8f685fab54ac3e1f54d794bd75f0"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Wed Apr 08 17:34:02 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Wed Apr 08 17:34:02 2026 -0700"
      },
      "message": "[SPARK-56397][BUILD] Upgrade `ICU4J` to 78.3\n\n### What changes were proposed in this pull request?\n\nThis PR updates the ICU4J dependency to 78.3.\n\n### Why are the changes needed?\n\nTo keep the dependency up to date with the latest release.\n- https://github.com/unicode-org/icu/releases/tag/release-78.3\n  - https://github.com/unicode-org/icu/pull/3896\n  - https://cldr.unicode.org/downloads/cldr-48#482-changes\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.6)\n\nCloses #55264 from dongjoon-hyun/SPARK-56397.\n\nLead-authored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nCo-authored-by: dongjoon-hyun \u003cdongjoon-hyun@users.noreply.github.com\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "05f6f69bebca8f685fab54ac3e1f54d794bd75f0",
      "tree": "d4cc4c760178fed1403b33583ab4ce591a469990",
      "parents": [
        "528386c424d74ecbb048306a4166f9c793e683ef"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 09 09:24:14 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Thu Apr 09 09:24:14 2026 +0900"
      },
      "message": "[SPARK-56332][SQL][TESTS] Use `sql.SparkSession` in `trait SQLTestData`\n\n### What changes were proposed in this pull request?\nUse `sql.SparkSession` in `trait SQLTestData`\n\n### Why are the changes needed?\nthis is needed for merging SQLTestUtils and QueryTest\n\n### Does this PR introduce _any_ user-facing change?\nNo, test-only\n\n### How was this patch tested?\nCI\n\n### Was this patch authored or co-authored using generative AI tooling?\nCo-authored-by: Claude code (Opus 4.6)\n\nCloses #55162 from zhengruifeng/merge_sql_utils.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "528386c424d74ecbb048306a4166f9c793e683ef",
      "tree": "8d3a4a817a3746815cd1a22ecd18ce3a57f25913",
      "parents": [
        "21ada68e78d593885692a2af34b493846e9d4890"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Thu Apr 09 09:23:06 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Thu Apr 09 09:23:06 2026 +0900"
      },
      "message": "[SPARK-56377][PYTHON] Add type hint for shuffle.py\n\n### What changes were proposed in this pull request?\n\nAdd type hints for shuffle.py\n\n### Why are the changes needed?\n\nEffort to polish pyspark type annotation.\n\n`shuffle.py` is a really old file (10 yo), I\u0027m trying to annotate it without any behavior change. The design of it has some fundamental issues so we have to ignore some type conflicts. Major one would be `ExternalListOfList` claims to be `ExternalList[list[V]]` but `__iter__` returns `V` instead of `list[V]`. I understand the goal behind it but still think it\u0027s a bad design.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nmypy passed locally. 3.10 does not crash. Need to wait for CI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55241 from gaogaotiantian/shuffle-type-hint.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "21ada68e78d593885692a2af34b493846e9d4890",
      "tree": "7553fa18afb9eef47c51bdd7204d563efc905507",
      "parents": [
        "a7fb5bbe6f86fc761002a8fd7b0d59b55568a4e9"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Apr 09 09:17:03 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Thu Apr 09 09:17:03 2026 +0900"
      },
      "message": "[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input\n\n### What changes were proposed in this pull request?\n\nAllow `spark.read.json()` to accept a DataFrame as input, in addition to file paths and RDDs. The first column of the input DataFrame must be of StringType; additional columns are ignored.\n\n### Why are the changes needed?\n\nParsing in-memory JSON text into a structured DataFrame currently requires `sc.parallelize()`, which is unavailable on Spark Connect. Accepting a DataFrame as input provides a Connect-compatible alternative. This is the inverse of `DataFrame.toJSON()`.\n\nPart of SPARK-55227.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. `spark.read.json()` now accepts a DataFrame as input. The first column must be StringType; additional columns are ignored.\n\n### How was this patch tested?\n\nNew tests in `test_datasources.py` (classic) and `test_connect_readwriter.py` (Connect).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #55097 from Yicong-Huang/SPARK-56253.\n\nLead-authored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nCo-authored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "a7fb5bbe6f86fc761002a8fd7b0d59b55568a4e9",
      "tree": "3333b9ea2a703522abb72ccc9a3fc4eeeddb109e",
      "parents": [
        "efa725e8772aff881a4c0b1dab9cd503790eae6f"
      ],
      "author": {
        "name": "Ivan Sadikov",
        "email": "ivan.sadikov@databricks.com",
        "time": "Thu Apr 09 08:13:42 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Thu Apr 09 08:13:42 2026 +0900"
      },
      "message": "[SPARK-56330][CORE] Add TaskInterruptListener to TaskContext for interrupt notifications\n\n### What changes were proposed in this pull request?\nAdded `TaskInterruptListener`, a new listener type that fires immediately when a task is interrupted via `markInterrupted`. The new listener follows the same invocation model as `TaskCompletionListener` and `TaskFailureListener` - all listeners run sequentially, exceptions in individual listeners don\u0027t suppress other listeners, and listener exceptions cause task failure.\n\nAlso fixes `TaskKilledException` to pass the cancellation reason to the `RuntimeException` constructor, so `getMessage()` no longer returns `null`.\n\n### Why are the changes needed?\nThe polling-based cancellation API (`killTaskIfInterrupted`) cannot support push-style reactions to task interruption. Components that block on I/O or hold locks have no way to react immediately when cancellation is signalled without implementing their own polling thread. `TaskInterruptListener` closes this gap by providing a callback that fires synchronously when `markInterrupted` is called.\n\n### Does this PR introduce _any_ user-facing change?\nYes. Adds a new `DeveloperApi` listener interface `TaskInterruptListener` and a corresponding `addTaskInterruptListener` method on `TaskContext`. Also fixes `TaskKilledException#getMessage()` to return the actual reason string rather than `null`.\n\n### How was this patch tested?\nAdded a `TaskContextSuite` test that verifies all registered `TaskInterruptListener`s are invoked even when some throw exceptions, and that the interrupt reason is propagated into the resulting `TaskCompletionListenerException` message. Added Java interop coverage in `JavaTaskContextCompileCheck`.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #55151 from sadikovi/SPARK-56330.\n\nAuthored-by: Ivan Sadikov \u003civan.sadikov@databricks.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "efa725e8772aff881a4c0b1dab9cd503790eae6f",
      "tree": "ae76c69ebb063b0b3ae40570cebad3d50f297c03",
      "parents": [
        "33c18eea72ecc4141981bc97a843b3bfbdb8ec1d"
      ],
      "author": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Wed Apr 08 15:29:57 2026 -0700"
      },
      "committer": {
        "name": "Anton Okolnychyi",
        "email": "aokolnychyi@apache.org",
        "time": "Wed Apr 08 15:29:57 2026 -0700"
      },
      "message": "[SPARK-56343][SQL][TESTS] Add MERGE INTO test for type mismatch without schema evolution trigger condition\n\n### What changes were proposed in this pull request?\n\nAdd two tests to `MergeIntoSchemaEvolutionTypeWideningAndExtraFieldTests` for MERGE INTO schema evolution with cross-column assignments (which does not trigger schema evolution), where the source has a type mismatch on a same-named column:\n\n1. **No evolution for compatible cross-column assignment**: `UPDATE SET salary \u003d s.bonus` where `source.salary` is LONG and `target.salary` is INT. Since the assignment uses `s.bonus` (not `s.salary`), the type mismatch on `salary` should be irrelevant and no schema evolution should occur. Asserts both data and schema remain unchanged.\n\n2. **Error for incompatible cross-column assignment**: `UPDATE SET salary \u003d s.bonus` where `s.bonus` is STRING and `target.salary` is INT. This should fail regardless of schema evolution because the explicit assignment has incompatible types.\n\n### Why are the changes needed?\n\nThese tests cover a gap in schema evolution test coverage. The existing tests for cross-column assignments (`salary \u003d s.bonus`) did not include the scenario where the source also has a same-named column (`salary`) with a different type. This is important to verify that schema evolution correctly considers only the actual assignment columns, not unrelated same-named columns with wider types.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNew tests added and run via:\n```\nbuild/sbt \u0027sql/testOnly *GroupBasedMergeIntoSchemaEvolutionSQLSuite -- -z \"type mismatch on existing column\"\u0027\n```\nAll 4 test cases pass (2 tests x 2 variants: with/without evolution clause).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Cursor (Claude claude-4.6-opus-high-thinking)\n\nCloses #55173 from szehon-ho/SPARK-56343.\n\nAuthored-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\nSigned-off-by: Anton Okolnychyi \u003caokolnychyi@apache.org\u003e\n"
    },
    {
      "commit": "33c18eea72ecc4141981bc97a843b3bfbdb8ec1d",
      "tree": "fcd413edf74f55739182020240c0cec86c97240c",
      "parents": [
        "0eac89398ec1d358b075fb0ee154c0ed5f28c3ff"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Wed Apr 08 10:31:31 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Wed Apr 08 10:31:31 2026 -0700"
      },
      "message": "[SPARK-56393][K8S][DOCS] Drop K8s v1.33 Support\n\n### What changes were proposed in this pull request?\n\nThis PR aims to update K8s docs to recommend K8s v1.34+ at Apache Spark 4.2.0 to utilize more stable features like the following example features at K8s v1.34.\n\n- [Dynamic Resource Allocation (GA)](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)\n\n### Why are the changes needed?\n\n**1. K8s v1.33 will enter the maintenance mode soon (2026-04-28) and will reach the end of support on 2026-06-28**\n- https://kubernetes.io/releases/patch-releases/#1-33\n\n**2. Default K8s Versions in Public Cloud environments**\n\nThe default K8s versions of public cloud providers are already moving to K8s 1.35+ like the following.\n\n- EKS: v1.35 (Default)\n- AKS: v1.35 (Default)\n- GKE: v1.35 (Stable)\n\n**3. End Of Support in Public Cloud environments**\n\nIn addition, K8s 1.33 will reach the end of standard support.\n\n| K8s  |   AKE  |  EKS  |  GKE  |\n| ---- | ------- | ------- | ------- |\n| 1.33 | 2026-06 | 2026-07 | 2026-08 |\n\n- [AKS EOL Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs\u003dazure-cli#aks-kubernetes-release-calendar)\n- [EKS EOL Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar)\n- [GKE EOL Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, this is a documentation-only change about K8s versions.\n\n### How was this patch tested?\n\nManual review.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55263 from dongjoon-hyun/SPARK-56393.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "0eac89398ec1d358b075fb0ee154c0ed5f28c3ff",
      "tree": "5ed5aad2d55e5fb2355838ad254779a1f7a20312",
      "parents": [
        "a3930d3166570858c21aa0eb7179b8c19b9ca2bf"
      ],
      "author": {
        "name": "Rahul Sharma",
        "email": "rahul.sharma@databricks.com",
        "time": "Thu Apr 09 01:03:29 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 09 01:03:29 2026 +0800"
      },
      "message": "[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds\n\n### What changes were proposed in this pull request?\n\nChange `Sample.seed` from `Long` to `Option[Long]` so that Spark can distinguish between user-specified seeds and system-generated random seeds.\n\n- `Sample.seed` type changed from `Long` to `Option[Long]`. `Some(seed)` means the user explicitly specified a seed (via SQL `REPEATABLE` clause or the programmatic `sample(fraction, seed)` API). `None` means no seed was specified.\n- Added a `Sample` companion object with a backwards-compatible `apply(... seed: Long ...)` overload that wraps the seed in `Some`, so all existing callers continue to compile unchanged.\n- The SQL parser now produces `Some(seed)` when a `REPEATABLE (seed)` clause is present, and `None` otherwise (instead of eagerly generating a random seed).\n- `SampleExec` resolves `None` to a random seed lazily via a new `resolvedSeed` field.\n- `SparkConnectPlanner` passes `Some(seed)` when the proto message has a seed, `None` otherwise.\n- `Dataset.sample(fraction)` and `Dataset.sample(withReplacement, fraction)` (the no-seed overloads) now pass `None` directly instead of generating a random seed upfront.\n- `V2ScanRelationPushDown` resolves `None` to a random seed when pushing down to DSV2 connectors, preserving existing behavior.\n\n### Why are the changes needed?\n\nPreviously, the parser always generated a random seed when no `REPEATABLE` clause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for correctness — for example, `TABLESAMPLE (x PERCENT) REPEATABLE (seed)` relies on deterministic row ordering within partitions, which may require disabling optimizations like out-of-order file processing. Without `Option[Long]`, there is no way to know at the physical plan level whether ordering guarantees are needed.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. The `Sample` companion object provides a backwards-compatible `apply` that accepts `Long`, so all existing code continues to work unchanged. The runtime behavior for both seeded and unseeded samples is preserved.\n\n### How was this patch tested?\n\nUpdated `PlanParserSuite` to verify that:\n- `TABLESAMPLE` without `REPEATABLE` produces `seed \u003d None`\n- `TABLESAMPLE ... REPEATABLE (seed)` produces `seed \u003d Some(seed)`\n\nAdded new test cases for the `REPEATABLE` clause with both percent and bucket sampling.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.6)\n\nCloses #55261 from rahulketch/sample-seed-optional.\n\nAuthored-by: Rahul Sharma \u003crahul.sharma@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "a3930d3166570858c21aa0eb7179b8c19b9ca2bf",
      "tree": "4c5d760475b82794985122616ae722b73d02e89d",
      "parents": [
        "361b9d6f16316cbcd247f89ea9889a1e3c69fe89"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 09 01:00:25 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 09 01:00:25 2026 +0800"
      },
      "message": "[SPARK-56221][SQL][PYTHON][FOLLOWUP] Rename TablePartition and remove SHOW CACHED TABLES / listCachedTables\n\n### What changes were proposed in this pull request?\n\nFollow-up to #55025 addressing post-merge review comments:\n\n**1. Rename `CatalogTablePartition` → `TablePartition`**\n\nThe public API class `org.apache.spark.sql.catalog.CatalogTablePartition` has the same name as the internal `org.apache.spark.sql.catalyst.catalog.CatalogTablePartition`, causing confusion and potential import ambiguity.\n\n**2. Remove `SHOW CACHED TABLES` SQL command and `listCachedTables()` catalog API**\n\nNo other database has a `SHOW CACHED TABLES` command, and the programmatic `listCachedTables()` API was designed to complement it. Both are removed, along with the `CachedTable` class, connect proto `ListCachedTables`, and all related infrastructure.\n\nFor SQL users who want to check cache status, a better approach would be to add an `isCached` column to the existing `SHOW TABLES` output, which achieves feature parity with the Scala/Python `isCached()` API.\n\n### Why are the changes needed?\n\n- `CatalogTablePartition` name clashes with an existing internal class.\n- `SHOW CACHED TABLES` is a non-standard SQL command with no precedent in other databases; the programmatic API that complemented it is also unnecessary.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, within the unreleased master branch (4.2.0):\n- `CatalogTablePartition` is renamed to `TablePartition` (both Scala and Python).\n- `SHOW CACHED TABLES` SQL command is removed.\n- `spark.catalog.listCachedTables()` and the `CachedTable` class are removed.\n\n### How was this patch tested?\n\nRemoved tests for `SHOW CACHED TABLES`, `listCachedTables`, and the parser. Existing tests for `TablePartition` / `listPartitions` remain.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (claude-sonnet-4-6, claude-opus-4-6)\n\nCloses #55139 from cloud-fan/followup.\n\nLead-authored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nCo-authored-by: Wenchen Fan \u003ccloud0fan@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "361b9d6f16316cbcd247f89ea9889a1e3c69fe89",
      "tree": "b7c165877c8d43b2ea129d60ec44c1681845c92e",
      "parents": [
        "426accf105097f6f9510433ee496808c378cfd7c"
      ],
      "author": {
        "name": "Haiyang Sun",
        "email": "haiyang.sun@databricks.com",
        "time": "Wed Apr 08 21:09:43 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Wed Apr 08 21:09:43 2026 +0800"
      },
      "message": "[SPARK-55278][FOLLOWUP] Add shading rule for udf/worker protobuf as for core and connect\n\n### What changes were proposed in this pull request?\nAdd shading rule for udf/worker protobuf as for core and connect.\n\n### Why are the changes needed?\nUnify protobuf handling as in spark core and spark connect.\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nN/A\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes, AI helped with the build files to be consistent with existing ones.\n\nCloses #55240 from haiyangsun-db/SPARK-55278-shading.\n\nAuthored-by: Haiyang Sun \u003chaiyang.sun@databricks.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "426accf105097f6f9510433ee496808c378cfd7c",
      "tree": "15f817e6c67192078bfe95e6872fc40db699da6d",
      "parents": [
        "a66bf7d35d0a06d44e2947c02b2a0781ce72bcb3"
      ],
      "author": {
        "name": "Felipe Fujiy Pessoto",
        "email": "fepessot@microsoft.com",
        "time": "Wed Apr 08 20:10:22 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Apr 08 20:10:22 2026 +0800"
      },
      "message": "[SPARK-55109][SQL] Enhance RaiseError to generate valid SQL\n\n### What changes were proposed in this pull request?\nFix RaiseError.sql: [[SPARK-55109] RaiseError(xyz).sql is broken in 4.0](https://issues.apache.org/jira/browse/SPARK-55109)\n\nRepro:\n\nimport org.apache.spark.sql.catalyst.expressions.{Literal, RaiseError}\nprintln(RaiseError(Literal(\"error!\")).sql)\n\n3.5 generates valid SQL:\n\nraise_error(\u0027error!\u0027)\n\n4.0:\nraise_error(\u0027USER_RAISED_EXCEPTION\u0027, map(\u0027errorMessage\u0027, \u0027error!\u0027))\n\n### Why are the changes needed?\nTo fix the regression in 4.0\n\n### Does this PR introduce _any_ user-facing change?\nBug fix, now it generates valid SQL.\n\n### How was this patch tested?\nNew unit test.\n\n\u003cimg width\u003d\"785\" height\u003d\"546\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/5929d48c-7b11-430d-8e7a-627ad8c50a11\" /\u003e\n\u003cimg width\u003d\"416\" height\u003d\"139\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/7f2cf0a3-3ecf-47db-b2c3-abef5b2be408\" /\u003e\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Github Copilot v1.0.14-0\n\nCloses #55115 from felipepessoto/raise_error_sql_fix.\n\nAuthored-by: Felipe Fujiy Pessoto \u003cfepessot@microsoft.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "a66bf7d35d0a06d44e2947c02b2a0781ce72bcb3",
      "tree": "7e84f1232174629e38be2970810e9dd4cbe3dc84",
      "parents": [
        "e96b48b577a8eaeb6b3160fb536281f1b9151423"
      ],
      "author": {
        "name": "anshul_baliga7",
        "email": "baligaanshul@gmail.com",
        "time": "Wed Apr 08 20:06:49 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Apr 08 20:06:49 2026 +0800"
      },
      "message": "[SPARK-56277][SQL] Add missing toString() to NamespaceChange and TableChange property classes\n\n### What changes were proposed in this pull request?\n\nAdd missing `toString()` to `NamespaceChange.SetProperty` and `NamespaceChange.RemoveProperty`, and fix the error message formatting in `unsupportedJDBCNamespaceChangeInCatalogError`.\n\n### Why are the changes needed?\n\nWithout `toString()`, error messages display Java object references instead of readable descriptions when namespace changes fail in JDBC catalogs. [SPARK-55828](https://issues.apache.org/jira/browse/SPARK-55828) fixed the same issue for TableChange inner classes.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. Error messages now show readable descriptions instead of Java object references.\n\n### How was this patch tested?\n\nFollows the same pattern as [SPARK-55828](https://issues.apache.org/jira/browse/SPARK-55828).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55085 from anshulbaliga7/add_toString_NamespaceChange.\n\nAuthored-by: anshul_baliga7 \u003cbaligaanshul@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "e96b48b577a8eaeb6b3160fb536281f1b9151423",
      "tree": "9b7ccea34e4bb88742318d7349e05c9bce434d72",
      "parents": [
        "590b0d5c4883da911f642498e6ea49ad188396eb"
      ],
      "author": {
        "name": "ilicmarkodb",
        "email": "marko.ilic@databricks.com",
        "time": "Wed Apr 08 19:47:42 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Apr 08 19:47:42 2026 +0800"
      },
      "message": "[SPARK-56328][SQL][FOLLOWUP] Handle SubqueryAlias-wrapped inline tables in INSERT VALUES collation fix\n\n### What changes were proposed in this pull request?\n\nThis is a followup to #55160. It fixes a case missed in the original PR where the `VALUES` clause has a table alias (e.g., `INSERT INTO t VALUES (...) AS T(c1)`), which wraps the `UnresolvedInlineTable` in a `SubqueryAlias`.\n\nThe original fix in `ResolveInlineTables` only pattern-matched `InsertIntoStatement` and `OverwriteByExpression` with a direct `UnresolvedInlineTable` child. When a table alias is present, the query child is a `SubqueryAlias` wrapping the `UnresolvedInlineTable`, so the pattern didn\u0027t match and fell through to the generic case without `ignoreCollation`, failing with `INCOMPATIBLE_TYPES_IN_INLINE_TABLE`.\n\nThis PR adds `SubqueryAlias`-wrapped cases for both `InsertIntoStatement` and `OverwriteByExpression`.\n\nNote: the eager evaluation path (parser) was not affected because `isInlineTableInsideInsertValuesClause` runs before `optionalMap` wraps the table in a `SubqueryAlias`.\n\n### Why are the changes needed?\n\nWithout this fix, the following fails when eager evaluation is disabled:\n\n```sql\nCREATE TABLE t (c1 STRING COLLATE UTF8_LCASE);\nINSERT INTO t VALUES (\u0027a\u0027 COLLATE UTF8_LCASE), (\u0027b\u0027 COLLATE UNICODE) AS vals(c1);\n-- AnalysisException: INCOMPATIBLE_TYPES_IN_INLINE_TABLE\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. `INSERT INTO ... VALUES (...) AS T(c1)` with conflicting collations now succeeds when the target column has a collation, matching the behavior of the non-aliased form.\n\n### How was this patch tested?\n\nExtended the existing \"INSERT VALUES with explicit conflicting collations\" test in `CollationSuite` to include a second INSERT with a table alias (`AS vals(c1)`), covering both eager and non-eager evaluation paths.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #55259 from ilicmarkodb/followup_collation_insert.\n\nAuthored-by: ilicmarkodb \u003cmarko.ilic@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "590b0d5c4883da911f642498e6ea49ad188396eb",
      "tree": "2773e48838cf731028d7a0af31beac1b340f6d90",
      "parents": [
        "163abe513a841e9d77699f8158ccd0054d6211fa"
      ],
      "author": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Wed Apr 08 14:05:07 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Apr 08 14:05:07 2026 +0800"
      },
      "message": "[SPARK-56346][SQL] Use PartitionPredicate in DSV2 Metadata Only Delete\n\n### What changes were proposed in this pull request?\n\nWhen `OptimizeMetadataOnlyDeleteFromTable` fails to translate all delete predicates to standard V2 filters, it now falls back to a second pass that converts partition-column filters to `PartitionPredicate`s (reusing the SPARK-55596 infrastructure), translates any remaining data-column filters to standard V2 predicates, and combines them for `table.canDeleteWhere`. This mirrors the two-pass approach already used for scan filter pushdown in `PushDownUtils.pushPartitionPredicates`.\n\n### Why are the changes needed?\n\nCurrently, `OptimizeMetadataOnlyDeleteFromTable` only attempts to translate all delete predicates to standard V2 filters. If any predicate cannot be translated (e.g. complex expressions on partition columns), the optimization falls back to an expensive row-level delete even though the table could accept the predicates as `PartitionPredicate`s. This change enables the metadata-only delete path in more cases by leveraging the `PartitionPredicate` infrastructure introduced in SPARK-55596.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNew test suite `DataSourceV2EnhancedDeleteFilterSuite` with 9 test cases covering: first-pass accept, second-pass accept/reject, mixed partition+data filters, UDF on non-contiguous partition columns, multiple PartitionPredicates, and row-level fallback. Existing suites verified for no regressions: `DataSourceV2EnhancedPartitionFilterSuite`, `GroupBasedDeleteFromTableSuite`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Cursor agent mode)\n\nCloses #55179 from szehon-ho/delete_partition_filter.\n\nAuthored-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "163abe513a841e9d77699f8158ccd0054d6211fa",
      "tree": "3fa2c1c66312d0a01c1e7c55d84716281c03bf39",
      "parents": [
        "aa6154b0e203f876ad31b2a8179297fca1208437"
      ],
      "author": {
        "name": "Xiaoxuan Li",
        "email": "xioxuan@amazon.com",
        "time": "Wed Apr 08 12:31:13 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Wed Apr 08 12:31:13 2026 +0900"
      },
      "message": "[SPARK-56092][SS][CONNECT] Fix NPE in StreamingQueryException.toString() when cause is null\n\n### What changes were proposed in this pull request?\nAdd null safety to StreamingQueryException.toString() to handle cases where cause is null or cause.getMessage is null.\n\nChanges:\n- StreamingQueryException.toString(): use Option to guard against both null cause and null cause.getMessage, falling back to message field.\n- New StreamingQueryExceptionSuite with tests for null cause, normal cause, and cause with null message.\n\n### Why are the changes needed?\nIn Spark Connect, GrpcExceptionConverter constructs StreamingQueryException with params.cause.orNull, which passes null when the original cause is unavailable during gRPC deserialization. Calling toString() on such an instance throws NPE because the original code calls cause.getMessage unconditionally.\n\n### Does this PR introduce _any_ user-facing change?\nNo. Before this fix, calling toString() on a StreamingQueryException with a null cause (e.g. from Spark Connect) would throw an NPE, hiding the actual error message. After this fix, it correctly displays the exception message instead.\n\n### How was this patch tested?\nNew unit tests in StreamingQueryExceptionSuite covering three scenarios:\n- cause is null\n- cause is non-null with a message\n- cause is non-null but getMessage returns null\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes, co-authored with Kiro.\n\nCloses #55044 from xiaoxuandev/fix-56092.\n\nAuthored-by: Xiaoxuan Li \u003cxioxuan@amazon.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "aa6154b0e203f876ad31b2a8179297fca1208437",
      "tree": "76965df48d6c5b11c181389222f795259d93668b",
      "parents": [
        "c8695495569e4058a758b111ebc942fc4906494c"
      ],
      "author": {
        "name": "Jerry Peng",
        "email": "jerry.peng@databricks.com",
        "time": "Tue Apr 07 20:25:03 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Tue Apr 07 20:25:03 2026 -0700"
      },
      "message": "[SPARK-55306] Add ability to run Kafka tests in Python\n\n### What changes were proposed in this pull request?\n\nThis PR adds infrastructure to enable Kafka integration tests in PySpark using Docker test containers. The changes include:\n\n   1. **KafkaUtils class** (`python/pyspark/sql/tests/streaming/kafka_utils.py`): A utility class that manages a Kafka test cluster via Docker, providing methods to:\n      - Start/stop a single-broker Kafka cluster using testcontainers-python\n      - Create and delete topics\n      - Send messages to Kafka topics\n      - Read records from topics using Spark\n      - Helper methods for testing streaming queries (`assert_eventually`, `wait_for_query_alive`)\n\n   2. **Test infrastructure setup**:\n      - Maven plugin in `connector/kafka-0-10-sql/pom.xml` to generate classpath.txt for dependency management\n      - Helper functions in `python/pyspark/testing/sqlutils.py` to read classpaths from both Maven and SBT builds\n      - Module registration in `dev/sparktestsupport/modules.py` to include Kafka tests in the test suite\n\n   3. **Real-time Mode Test** (`python/pyspark/sql/tests/streaming/test_streaming_kafka_rtm.py`): Demonstrates testing a stateless Kafka-to-Kafka streaming query\n\n   4. **Comprehensive documentation** (`python/pyspark/sql/tests/streaming/KAFKA_TESTING.md`): Complete guide on writing Kafka tests, API reference, testing patterns, and troubleshooting\n\n   5. **Dependency updates**:\n      - Added `testcontainers[kafka]\u003e\u003d3.7.0` and `kafka-python\u003e\u003d2.0.2` to `dev/requirements.txt`\n      - Updated Docker test images for Python 3.11, 3.12, 3.13, and 3.14 to include Kafka test dependencies\n\n### Why are the changes needed?\n\nPreviously, PySpark lacked infrastructure to write integration tests for Kafka streaming functionality. This made it difficult to:\n   - Testing RTM (real-time mode) queries in PySpark since RTM Kafka is the only support source currently.\n   - Test Structured Streaming with Python in general as the memory source is also not available in pyspark.\n\n   This infrastructure enables developers to write reliable integration tests without requiring manual Kafka cluster setup, improving test coverage and reliability for PySpark\u0027s Kafka  streaming capabilities.\n\n### Does this PR introduce _any_ user-facing change?\n\n   No. This PR only adds test infrastructure and does not change any user-facing APIs or behavior.\n\n### How was this patch tested?\n\n- Added `test_streaming_stateless` in `test_streaming_kafka_rtm.py` that verifies a Kafka-to-Kafka streaming query\n   - The test demonstrates the full workflow: starting Kafka via Docker, producing messages, running a streaming query, and verifying output\n   - Verified the test infrastructure works with both Maven and SBT builds\n   - Tested that Docker container lifecycle (start/stop) is properly managed\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nPartially by Claude\n\nCloses #53415 from jerrypeng/rtm_python.\n\nLead-authored-by: Jerry Peng \u003cjerry.peng@databricks.com\u003e\nCo-authored-by: Boyang Jerry Peng \u003cjerry.boyang.peng@gmail.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "c8695495569e4058a758b111ebc942fc4906494c",
      "tree": "48342219575228cb459e6d493e320b905dea2db8",
      "parents": [
        "c7c78e06384484b2323a813d1df8800705a98f6e"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 08 08:44:16 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Wed Apr 08 08:44:16 2026 +0900"
      },
      "message": "[SPARK-56372][INFRA] Add cmake to CI Docker images for R fs package compilation\n\n### What changes were proposed in this pull request?\n\nAdd `cmake` to the `apt-get install` list in CI Docker images (`docs`, `lint`, `sparkr`).\n\n### Why are the changes needed?\n\nThe R `fs` package (a transitive dependency of `devtools`, `testthat`, `rmarkdown`) now bundles `libuv v1.52.0`, which requires `cmake` to build. This causes the \"Base image build\" job to fail with:\n\n```\n/bin/bash: line 2: cmake: command not found\nmake: *** [Makevars:44: libuv] Error 127\nERROR: compilation failed for package \u0027fs\u0027\n```\n\nThe `fs` compilation failure cascades into: `sass` → `bslib` → `shiny` → `rmarkdown` → `devtools` → `testthat`, breaking the entire R package installation.\n\nSee https://github.com/apache/spark/actions/runs/24067715329/job/70197367201\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI should pass with the updated Docker images.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nCo-authored-by: Claude code (Opus 4.6)\n\nCloses #55233 from zhengruifeng/fix_imsage.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "c7c78e06384484b2323a813d1df8800705a98f6e",
      "tree": "a2f6b88acac437c6768fa440c079f0017921d6a3",
      "parents": [
        "e8b5cb84e4010513939f8193e7fb54848ac830d4"
      ],
      "author": {
        "name": "DB Tsai",
        "email": "dbtsai@dbtsai.com",
        "time": "Tue Apr 07 16:38:56 2026 -0700"
      },
      "committer": {
        "name": "DB Tsai",
        "email": "dbtsai@dbtsai.com",
        "time": "Tue Apr 07 16:38:56 2026 -0700"
      },
      "message": "[MINOR][CORE][TESTS] Fix flaky DAGSchedulerSuite test for push-based shuffle\n\n### What changes were proposed in this pull request?\n\n  Add `sc.listenerBus.waitUntilEmpty()` before the `completedStage` assertion in the push-based shuffle test in `DAGSchedulerSuite`.\n\n  ### Why are the changes needed?\n\n  The test was intermittently failing due to a race condition. `SparkListenerStageCompleted` events are delivered asynchronously through the\n  listener bus, so the `completedStage` assertion could execute before the event was processed, causing a spurious failure. Waiting for the\n  listener bus to drain before asserting eliminates the race.\n\n  ### Does this PR introduce _any_ user-facing change?\n\n  No.\n\n  ### How was this patch tested?\n\n  Existing test in `DAGSchedulerSuite`. The fix removes a race condition rather than adding new test logic.\n\n  ### Was this patch authored or co-authored using generative AI tooling?\n\n  Generated-by: Claude Sonnet 4.6 (Claude Code)\n\nCloses #55221 from dbtsai/dagscheduler-flaky-fix.\n\nAuthored-by: DB Tsai \u003cdbtsai@dbtsai.com\u003e\nSigned-off-by: DB Tsai \u003cdbtsai@dbtsai.com\u003e\n"
    },
    {
      "commit": "e8b5cb84e4010513939f8193e7fb54848ac830d4",
      "tree": "1df5d7b8c08f0d20f118ca499050c12ad9dd700b",
      "parents": [
        "b490770d049acfcffa409e16917e92aa3db41f3d"
      ],
      "author": {
        "name": "ericm-db",
        "email": "eric.marnadi@databricks.com",
        "time": "Tue Apr 07 11:47:17 2026 -0700"
      },
      "committer": {
        "name": "Anish Shrigondekar",
        "email": "anish.shrigondekar@databricks.com",
        "time": "Tue Apr 07 11:47:17 2026 -0700"
      },
      "message": "[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot\n\n### What changes were proposed in this pull request?\n\nThis PR adds auto-repair snapshot support to the checkpoint V2 (state store checkpoint IDs) load path. Previously, auto-repair and V2 were completely disjoint: `loadWithCheckpointId` had no recovery logic for corrupt snapshots, and the auto-repair path (`loadSnapshotWithoutCheckpointId`) only handled V1 files without UUID awareness.\n\nChanges:\n- **RocksDBFileManager**: Added `getSnapshotVersionsAndUniqueIdsFromLineage()` which returns all lineage-matching snapshots sorted descending (the V2 equivalent of `getEligibleSnapshotsForVersion`). Refactored `getLatestSnapshotVersionAndUniqueIdFromLineage()` to delegate to it.\n- **RocksDB**: Added `loadSnapshotWithCheckpointId()` which uses `AutoSnapshotLoader` with V2-specific callbacks that map version to uniqueId via a side-channel map. Changelog replay is included inside the load callback so corrupt changelogs also trigger fallback to the next older snapshot.\n- **RocksDB**: Wrapped the snapshot load + changelog replay block in `loadWithCheckpointId()` with a try-catch that delegates to the new auto-repair method when enabled. Uses `getFullLineage()` to build the complete lineage chain (back to version 1) so that version 0 fallback with full changelog replay works correctly.\n\n### Why are the changes needed?\n\nWithout this change, any corrupt or missing snapshot file in V2 mode causes a hard query failure with no recovery path. V1 mode already had auto-repair (falling back to older snapshots and replaying changelogs), but V2\u0027s `loadWithCheckpointId` bypassed that entirely. This is especially important because speculative execution can produce orphaned or incomplete snapshot files that V2 is designed to handle, but corruption of the \"winning\" snapshot had no fallback.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is an internal improvement to fault tolerance. Queries using checkpoint V2 that previously would fail on corrupt snapshots will now automatically recover when `autoSnapshotRepair.enabled` is true (the production default).\n\n### How was this patch tested?\n\nAdded integration test \"Auto snapshot repair with checkpoint format V2\" in `RocksDBSuite` covering:\n- Single corrupt V2 snapshot: falls back to older snapshot in lineage\n- All V2 snapshots corrupt: falls back to version 0 with full changelog replay\n- Verified state correctness and `numSnapshotsAutoRepaired` metric after repair\n\nAlso verified existing tests pass:\n- `AutoSnapshotLoaderSuite` (5/5)\n- `RocksDBSuite` V1 auto-repair test\n- `RocksDBStateStoreCheckpointFormatV2Suite` (24/24)\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.6)\n\nCloses #55015 from ericm-db/integrate-v2-auto-repair-snapshot.\n\nAuthored-by: ericm-db \u003ceric.marnadi@databricks.com\u003e\nSigned-off-by: Anish Shrigondekar \u003canish.shrigondekar@databricks.com\u003e\n"
    },
    {
      "commit": "b490770d049acfcffa409e16917e92aa3db41f3d",
      "tree": "11fb1b0a009abe3b83c213c3e4cfe8e70109c68d",
      "parents": [
        "7737a947d12e92f802cdafc193870b4e03741810"
      ],
      "author": {
        "name": "ilicmarkodb",
        "email": "marko.ilic@databricks.com",
        "time": "Tue Apr 07 22:08:52 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Apr 07 22:08:52 2026 +0800"
      },
      "message": "[SPARK-56328][SQL] Fix inline table collation handling for INSERT VALUES and DEFAULT COLLATION\n\n### What changes were proposed in this pull request?\n\nThis PR fixes two related issues with how collations interact with inline tables (VALUES clauses):\n\n**1. Eager evaluation bypasses DEFAULT COLLATION for CREATE TABLE/VIEW**\n\nInline tables are eagerly evaluated during parsing for performance. But when inside `CREATE TABLE ... DEFAULT COLLATION UTF8_LCASE AS SELECT * FROM VALUES (\u0027a\u0027) AS T(c1)`, the default collation must be applied to string literals during analysis. Since eager evaluation happens before analysis, the collation was lost.\n\nThe fix adds `canEagerlyEvaluateInlineTable` which prevents eager evaluation when the inline table is inside a CREATE TABLE/VIEW statement and contains expressions that need default collation resolution (e.g., string literals or casts to string types).\n\n**2. INSERT INTO VALUES fails with INCOMPATIBLE_TYPES_IN_INLINE_TABLE for collated columns**\n\nWhen using `INSERT INTO ... VALUES` with collated columns, the inline table resolution could fail because values in the same column end up with different collations. This happens when:\n- `ResolveColumnDefaultInCommandInputQuery` resolves `DEFAULT` to a typed null with the target column\u0027s collation, which differs from other literals\u0027 collation\n- Explicit `COLLATE` or `CAST` to collated string types on values produces mismatched collations across rows\n\nThe fix adds an `ignoreCollation` parameter to `EvaluateUnresolvedInlineTable` that strips collations from input types before finding the common type. This is safe for INSERT because the INSERT coercion will cast each value to the target column\u0027s type, including collation.\n\nThe collation stripping is applied only when the inline table is the direct VALUES clause of an INSERT statement:\n- **Parser path**: `isInlineTableInsideInsertValuesClause` walks up the parser context tree to `SingleInsertQueryContext` and inspects its `insertInto()` child to detect `INSERT INTO t VALUES (...)`. This is necessary because in the ANTLR grammar (`singleInsertQuery: insertInto query`), the `insertInto` context and the `query` context (containing `InlineTableContext`) are siblings, not ancestor-descendant. Returns false if a `FromClauseContext` is encountered first, meaning the VALUES is inside a subquery (e.g., `INSERT INTO t SELECT * FROM VALUES (...) AS T`).\n- **Analyzer path**: `ResolveInlineTables` pattern-matches `InsertIntoStatement` and `OverwriteByExpression` (from `INSERT REPLACE WHERE`) with a direct `UnresolvedInlineTable` query child.\n\nStandalone `SELECT * FROM VALUES (...)` and CTAS with conflicting explicit collations continue to fail as expected.\n\n### Why are the changes needed?\n\nWithout this fix:\n\n```sql\n-- Fails: DEFAULT COLLATION not applied to inline table literals\nCREATE TABLE t DEFAULT COLLATION UTF8_LCASE AS\n  SELECT * FROM VALUES (\u0027a\u0027), (\u0027b\u0027) AS T(c1) WHERE c1 \u003d \u0027A\u0027;\n-- Column c1 gets UTF8_BINARY instead of UTF8_LCASE\n\n-- Fails: INCOMPATIBLE_TYPES_IN_INLINE_TABLE\nCREATE TABLE t (c1 STRING COLLATE UTF8_LCASE, c2 STRING COLLATE UTF8_LCASE);\nINSERT INTO t VALUES (\u0027a\u0027, DEFAULT), (DEFAULT, DEFAULT);\n\n-- Fails at parse time with eager eval: CAST bakes collation into StringType\nCREATE TABLE t (c1 STRING COLLATE UTF8_LCASE);\nINSERT INTO t VALUES (CAST(\u0027a\u0027 AS STRING COLLATE UTF8_LCASE)), (CAST(\u0027b\u0027 AS STRING COLLATE UNICODE));\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nYes.\n- `CREATE TABLE/VIEW ... DEFAULT COLLATION ... AS SELECT * FROM VALUES (...)` now correctly applies the default collation to inline table literals.\n- `INSERT INTO ... VALUES` with collated columns now succeeds in cases that previously failed with `INCOMPATIBLE_TYPES_IN_INLINE_TABLE`.\n\n### How was this patch tested?\n\nNew tests in `CollationSuite` covering both eager and non-eager evaluation paths (`EAGER_EVAL_OF_UNRESOLVED_INLINE_TABLE_ENABLED \u003d true/false`):\n- INSERT VALUES with NULLs, DEFAULT, explicit conflicting collations, mixed collations, nested types (ARRAY)\n- INSERT OVERWRITE with collated values\n- INSERT REPLACE WHERE with collated values (V2 table)\n- INSERT VALUES with CAST to conflicting collated string types (exercises parser-level `ignoreCollation` since CAST bakes collation into StringType at parse time)\n- Collation stripping does not affect expression evaluation\n- Negative tests: standalone SELECT VALUES, INSERT SELECT FROM VALUES, and CTAS with conflicting collations still fail\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #55160 from ilicmarkodb/inline_talbe_collation.\n\nAuthored-by: ilicmarkodb \u003cmarko.ilic@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "7737a947d12e92f802cdafc193870b4e03741810",
      "tree": "309be445396dcb2ef122cb806a4d7adf2793f4bc",
      "parents": [
        "98cdaee3e4f9f2f9801cc0bec5483b096ef40db5"
      ],
      "author": {
        "name": "Haiyang Sun",
        "email": "haiyang.sun@databricks.com",
        "time": "Tue Apr 07 17:51:37 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Apr 07 17:51:37 2026 +0800"
      },
      "message": "[SPARK-55278] Introduce module and core abstraction for language-agnostic UDF worker\n\n### What changes were proposed in this pull request?\n\n  This PR introduces the foundational package structure and core abstractions for the language-agnostic UDF worker framework described in [SPIP SPARK-55278](https://issues.apache.org/jira/browse/SPARK-55278).\n\n  The new `udf/worker` module contains two sub-modules:\n\n  - **`proto/`** — Protobuf definition of `UDFWorkerSpecification` (currently a placeholder; full schema to follow), plus a typed Scala wrapper:\n    - `WorkerSpecification` — Scala wrapper around the protobuf spec.\n\n  - **`core/`** — Engine-side APIs (all `Experimental`):\n    - `WorkerDispatcher` — manages workers for a given spec; creates sessions. Handles pooling, reuse, and lifecycle behind the scenes. Extends `AutoCloseable`.\n    - `WorkerSession` — represents one single UDF execution. Not 1-to-1 with a worker process; multiple sessions may share the same underlying worker. Extends `AutoCloseable` with a default no-op `close()` so callers can use try-with-resources\n   from the start.\n    - `WorkerSecurityScope` — identifies a security boundary for worker connection pooling. Requires subclasses to implement `equals`/`hashCode` so that structurally equivalent scopes enable worker reuse.\n\n  Build integration:\n  - Maven and SBT build definitions for both sub-modules.\n  - `project/SparkBuild.scala` updated to register the new modules and configure unidoc exclusions (JavaUnidoc only — Scala API docs are included).\n\n  ### Why are the changes needed?\n\n  This is the first step toward a language-agnostic UDF protocol for Spark that enables UDF workers written in any language to communicate with the Spark engine through a well-defined specification and API boundary. The abstractions introduced here establish the core contract that concrete implementations (e.g., process-based or gRPC-based workers) will build on.\n\nWhy introduce a separate root-level module:\n1.\tThe worker specification module is not specific to Spark Connect—it should also support PySpark workers in the classic (non-Connect) mode.\n2.\tThe module has minimal dependency on Spark internals or the SQL engine, making it a poor fit for existing core or sql modules.\n3.\tKeeping it as a separate module helps maintain a clear focus on worker abstractions and improves modularity.\n\n  ### Does this PR introduce _any_ user-facing change?\n\n  No. All new APIs are marked `Experimental` and there are no behavioral changes to existing code.\n\n  ### How was this patch tested?\n\n  - Compilation verified via both Maven and SBT.\n  - `WorkerAbstractionSuite` provides a basic test placeholder.\n  - Scaladoc generation verified via `build/sbt unidoc` (ScalaUnidoc succeeds; JavaUnidoc excludes udf-worker modules, consistent with how `connectCommon`/`connect`/`protobuf` modules are handled).\n\n  ### Was this patch authored or co-authored using generative AI tooling?\n\n  Yes.\n\nCloses #55089 from haiyangsun-db/SPARK-55278.\n\nAuthored-by: Haiyang Sun \u003chaiyang.sun@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "98cdaee3e4f9f2f9801cc0bec5483b096ef40db5",
      "tree": "7c321bb582635cd0313e15ffbc6f79fc89febe34",
      "parents": [
        "9dbe381ed0efda97723f5e81518b8f19f97d00f2"
      ],
      "author": {
        "name": "Jitesh Soni",
        "email": "get2jitesh@gmail.com",
        "time": "Tue Apr 07 15:28:34 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Tue Apr 07 15:28:34 2026 +0900"
      },
      "message": "[SPARK-55450][SS][PYTHON][DOCS] Document admission control in PySpark streaming data sources\n\n### What changes were proposed in this pull request?\n\nThis PR adds documentation and an example for admission control in PySpark custom streaming data sources (SPARK-55304).\n\nChanges include:\n\n1. **Updated tutorial documentation** (`python/docs/source/tutorial/sql/python_data_source.rst`):\n   - Added \"Admission Control for Streaming Readers\" section\n   - Documents `getDefaultReadLimit()` returning `ReadMaxRows(n)` to limit batch size\n   - Shows how `latestOffset(start, limit)` respects the `ReadLimit` parameter\n\n2. **Example file** (`examples/src/main/python/sql/streaming/structured_blockchain_admission_control.py`):\n   - Demonstrates admission control via `getDefaultReadLimit()` and `latestOffset()`\n   - Simulates blockchain data source with controlled batch sizes (20 blocks per batch)\n   - Simple, focused example showing backpressure management\n\n### Why are the changes needed?\n\nUsers need documentation and practical examples to implement admission control in custom streaming sources (introduced in SPARK-55304).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Documentation and examples only.\n\n### How was this patch tested?\n\n**Testing approach:**\n- Ran the example on Databricks Dogfood Staging (DBR 17.3 / Spark 4.0)\n- Used the Spark Streaming UI to verify admission control works correctly\n\n**Test notebook:** [pr_54807_admission_control_notebook](https://dogfood.staging.databricks.com/editor/notebooks/1113954931051543?o\u003d6051921418418893#command/7790625346196924)\n\n**What was verified:**\n1. **Batch sizes:** Each micro-batch processed exactly 20 blocks (admission control working)\n2. **Consistent behavior:** 79 batches completed in ~28 seconds, all with 20 rows\n3. **Stream reader:** `PythonMicroBatchStreamWithAdmissionControl` active in Streaming UI\n\n**Sample batch output:**\n```json\n{\n  \"batchId\": 78,\n  \"numInputRows\": 20,\n  \"sources\": [{\n    \"description\": \"PythonMicroBatchStreamWithAdmissionControl\",\n    \"startOffset\": {\"block_number\": 1560},\n    \"endOffset\": {\"block_number\": 1580},\n    \"numInputRows\": 20\n  }]\n}\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes (Claude Opus 4.5)\n\n🤖 Generated with [Claude Code](https://claude.ai/code)\n\nCloses #54807 from jiteshsoni/SPARK-55450-admission-control-docs.\n\nLead-authored-by: Jitesh Soni \u003cget2jitesh@gmail.com\u003e\nCo-authored-by: Canadian Data Guy \u003cget2jitesh@gmail.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "9dbe381ed0efda97723f5e81518b8f19f97d00f2",
      "tree": "12fae678c7665c9f65abfbf4388c4ba4b09039f6",
      "parents": [
        "491add877a0956812c31e5ea8e0533bfb8610f1a"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Tue Apr 07 12:55:12 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@foxmail.com",
        "time": "Tue Apr 07 12:55:12 2026 +0800"
      },
      "message": "[SPARK-56340][PYTHON] Move input_type schema to eval conf\n\n### What changes were proposed in this pull request?\n\nUse eval conf to pass the schema json, instead of sending a random string before UDF.\n\n### Why are the changes needed?\n\nClean up JVM \u003c-\u003e python worker protocol. We should not randomly pass data.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n`test_udf` passed locally, the rest is on CI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55170 from gaogaotiantian/move-input-type.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@foxmail.com\u003e\n"
    },
    {
      "commit": "491add877a0956812c31e5ea8e0533bfb8610f1a",
      "tree": "efd4547356197e8674bbc78200b0fcd0d7d34846",
      "parents": [
        "af9c8b346673c0ffa79ce3b0a76e53e9df51fe76"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Tue Apr 07 10:28:36 2026 +0900"
      },
      "committer": {
        "name": "Kousuke Saruta",
        "email": "sarutak@apache.org",
        "time": "Tue Apr 07 10:28:36 2026 +0900"
      },
      "message": "[SPARK-56364][BUILD][TESTS] Generate Scala-based test JARs dynamically instead of storing pre-built binaries\n\n### What changes were proposed in this pull request?\nThis PR is a part of SPARK-56352 for Scala-based test JARs, replacing pre-built test JAR files containing Scala classes with dynamic compilation at test time, removing 6 binary JAR files and 1 binary from the repository.\n\nChanges:\n- Add `TestUtils.createJarWithScalaSources()` in `SparkTestUtils.scala` that compiles Scala source files via `scala.tools.nsc.Main` and packages the resulting classes into a JAR, with support for excluding specific classes by prefix.\n- Update test suites to use dynamically generated JARs instead of pre-built ones.\n- Refactor `StubClassLoaderSuite` to use a self-contained dummy class instead of the pre-built `udf_noA.jar` that contained spark-connect classes, eliminating the cross-module dependency noted in the original TODO comment.\n- Extract `StubClassDummyUdfPacker` from `StubClassDummyUdf.scala` into a separate file for use by `UDFClassLoadingE2ESuite`.\n- Remove deleted JAR/binary entries from `dev/test-jars.txt`.\n\nJARs/binaries removed:\n- `core/src/test/resources/TestHelloV2_2.13.jar`\n- `core/src/test/resources/TestHelloV3_2.13.jar`\n- `sql/connect/client/jvm/src/test/resources/TestHelloV2_2.13.jar`\n- `sql/connect/client/jvm/src/test/resources/udf2.13.jar`\n- `sql/connect/client/jvm/src/test/resources/udf2.13` (serialized binary)\n- `sql/core/src/test/resources/artifact-tests/udf_noA.jar`\n- `sql/hive/src/test/resources/regression-test-SPARK-8489/test-2.13.jar`\n\n### Why are the changes needed?\nAs noted in the PR discussion (https://github.com/apache/spark/pull/50378):\n\u003e the ultimate goal is to refactor the tests to automatically build the jars instead of using pre-built ones\n\nThis PR achieves that goal for all Scala-based test JARs. By generating JARs dynamically at test time, no binary artifacts need to be stored in the source tree, and the release-time workaround becomes unnecessary for these files.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nAll affected test suites pass:\n- ClassLoaderIsolationSuite (core)\n- StubClassLoaderSuite, ArtifactManagerSuite (sql/core)\n- HiveSparkSubmitSuite (sql/hive)\n- ReplE2ESuite (sql/connect)\n- UDFClassLoadingE2ESuite (sql/connect)\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Opus 4.6\n\nCloses #55218 from sarutak/remove-test-jars-c.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Kousuke Saruta \u003csarutak@apache.org\u003e\n"
    },
    {
      "commit": "af9c8b346673c0ffa79ce3b0a76e53e9df51fe76",
      "tree": "bd5a4854bb59411b4ae035a863ba15ebe3fad143",
      "parents": [
        "975b29964e9f456de3c091af1ef38a0362ffb128"
      ],
      "author": {
        "name": "Yan Yan",
        "email": "yyanyyyy@gmail.com",
        "time": "Tue Apr 07 08:04:23 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Apr 07 08:04:23 2026 +0800"
      },
      "message": "[SPARK-56273][SQL] Simplify extracting fields from DataSourceV2ScanRelation\n\n### What changes were proposed in this pull request?\n\n`DataSourceV2ScanRelation` is a case class with 5 fields. Many pattern match sites only need the `scan` field or a subset of `(relation, scan, output)`, using wildcards for the rest. This couples every match site to the constructor arity, so adding or removing fields requires updating all of them.\n\nThis PR introduces two extractors following the `ExtractV2Table` precedent (SPARK-53720):\n- `ExtractV2Scan`: returns `Scan`\n- `ExtractV2ScanRelation`: returns `(DataSourceV2Relation, Scan, Seq[AttributeReference])`\n\nUpdated 14 pattern match sites across 10 files. Type-based matches and constructor calls are unchanged.\n\n### Why are the changes needed?\nCode simplification. Similar to https://github.com/apache/spark/commit/6bae835ccbc8850ac5e2ab0225c6cd75921f06b4\n\n### Does this PR introduce _any_ user-facing change?\nno\n\n### How was this patch tested?\nThis PR relies on existing tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes - Opus 4.6 \n\nCloses #55070 from yyanyy/extract-v2-scan-relation-extractors.\n\nAuthored-by: Yan Yan \u003cyyanyyyy@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "975b29964e9f456de3c091af1ef38a0362ffb128",
      "tree": "67a3837711d8b2f20cae3d881e3f280f78187261",
      "parents": [
        "5bb627118c37e9c151cfa35d561e7fd2a7ef4f3d"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Tue Apr 07 07:42:23 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Tue Apr 07 07:42:23 2026 +0900"
      },
      "message": "[SPARK-56342][PYTHON] Tighten type hints for refactored eval type functions in worker.py\n\n### What changes were proposed in this pull request?\n\nTighten the type hints for `def func` signatures in `read_udfs()` for eval types that have been refactored to be self-contained. Specifically:\n\n- Replace `Iterator[Any]` with `Iterator[\"GroupedBatch\"]` for grouped eval types (`SQL_GROUPED_AGG_ARROW_UDF`, `SQL_GROUPED_AGG_ARROW_ITER_UDF`, `SQL_WINDOW_AGG_ARROW_UDF`)\n- Rename the `batches` parameter to `data` across all refactored eval types for consistency, since the same call site passes either a flat stream or grouped stream\n- Use `for batch in data` for flat streams and `for group in data` for grouped streams\n- Define `GroupedBatch` and `CoGroupedBatch` type aliases in `pyspark.sql.pandas._typing`\n\n### Why are the changes needed?\n\nAfter refactoring eval types to be self-contained in `read_udfs()`, the `def func` signatures used `Iterator[Any]` for grouped eval types, losing type information. Tightening these hints improves readability and enables better static analysis.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55178 from Yicong-Huang/SPARK-56342/tighten-type-hints.\n\nAuthored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "5bb627118c37e9c151cfa35d561e7fd2a7ef4f3d",
      "tree": "54dfc58fa4c809ac63c472f95c3fda1b63842eda",
      "parents": [
        "f184c25b6cd78896bed7fe3cdf2bf038462cf190"
      ],
      "author": {
        "name": "DB Tsai",
        "email": "dbtsai@dbtsai.com",
        "time": "Mon Apr 06 14:22:53 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Mon Apr 06 14:22:53 2026 -0700"
      },
      "message": "[SPARK-56207][SQL] Replace legacy error codes with named errors in DSv2 connector API\n\n### What changes were proposed in this pull request?\n\n  Replaced 18 _LEGACY_ERROR_TEMP_* error codes in the Data Source V2 (DSv2) connector API with properly named, descriptive error conditions. The affected interfaces are all in\n  sql/catalyst/src/main/java/org/apache/spark/sql/connector/:\n\n  - Read: Scan, PartitionReaderFactory, ContinuousPartitionReaderFactory\n  - Write: Write, WriteBuilder, DeltaWrite, DeltaWriteBuilder, LogicalWriteInfo\n  - Catalog: SupportsPartitionManagement, SupportsAtomicPartitionManagement\n  - Util: V2ExpressionSQLBuilder\n\n  New error names follow established Spark conventions (DATASOURCE_*, PARTITION_*, V2_EXPRESSION_SQL_BUILDER_*, UNEXPECTED_*) and are inserted alphabetically into error-conditions.json. All new entries\n  include a sqlState (0A000 for unsupported operations, 42000 for illegal arguments) and messages ending with a period.\n\n### Why are the changes needed?\n  Legacy error codes (_LEGACY_ERROR_TEMP_*) are opaque and hard to search, cross-reference, or document. Named error codes make it easier for users and downstream connectors to identify and handle specific\n  error conditions.\n\n### Does this PR introduce _any_ user-facing change?\n  Yes — error messages now include a sqlState and names are surfaced in error output, but the message text is preserved (with minor normalization: trailing periods added for consistency).\n\n### How was this patch tested?\n  Existing unit and integration tests cover these code paths. No behavioral changes were made.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude Code\n\nCloses #54971 from dbtsai/sqlState.\n\nAuthored-by: DB Tsai \u003cdbtsai@dbtsai.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "f184c25b6cd78896bed7fe3cdf2bf038462cf190",
      "tree": "bc6e6ad2284c686d7bcadff8ab2e7257f49217b2",
      "parents": [
        "ae7f6e382cf68a6cca631ee613c2dfafaa08f8c6"
      ],
      "author": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Tue Apr 07 05:42:31 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Tue Apr 07 05:42:31 2026 +0900"
      },
      "message": "[SPARK-56361][SS] Provide better error with logging on NPE in stream-stream join\n\n### What changes were proposed in this pull request?\n\nThis PR proposes to provide better error with additional logging on NPE in stream-stream join.\n\nWe have captured several places which could throw NPE - this PR provides the better error for users which is less cryptic and gives a quick mitigation if they are willing to tolerate losing some data to keep the query running. For devs, this PR leaves the context to the log, so that when users come to devs with the error, devs can at least know where to start looking into.\n\n### Why are the changes needed?\n\nThrowing NPE does not help anything for users and devs. For users, NPE is mostly cryptic error and they know nothing what to do to mitigate the issue. There is no information around context, so debugging is almost impossible for devs when this happens.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, users would no longer get NPE from stream-stream join for known places on potential NPE, and will get better exception with the guidance for quick mitigation (toleration of data loss).\n\n### How was this patch tested?\n\nModified test\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude 4.6 Opus\n\nCloses #55214 from HeartSaVioR/SPARK-56361.\n\nAuthored-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "ae7f6e382cf68a6cca631ee613c2dfafaa08f8c6",
      "tree": "26f40796f01d4569976b8e9a9725f89d9c4c69f8",
      "parents": [
        "e42a5617dd629ad042d0a41d89396676d1de09d9"
      ],
      "author": {
        "name": "Jia Teoh",
        "email": "jiateoh@gmail.com",
        "time": "Tue Apr 07 05:11:51 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Tue Apr 07 05:11:51 2026 +0900"
      },
      "message": "[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction\n\n### What changes were proposed in this pull request?\n\nEliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes, pass normalized tuples directly to schema.toInternal which handles them by index\n\nTo better explain the removal of `Row` usage:\n- The positional ordering is retained because Row was constructed purely positionally:\n    - Row.new stores values in insertion order ([link](https://github.com/jiateoh/spark/blob/615af5d154d1b15b26dfde3cc3441550eac5a615/python/pyspark/sql/types.py#L3547)). [This is also noted in the change notes since 3.0.0](https://github.com/jiateoh/spark/blob/615af5d154d1b15b26dfde3cc3441550eac5a615/python/pyspark/sql/types.py#L3492)\n    - `dict(zip(...))` preserves insertion order in python 3.7+ (dicts preserve insertion order)\n    - Inputs to `zip` are field_names, assumed to be same-ordered as the `converted` input data which is derived from the original input tuple in the same order.\n- `Row` is a tuple subclass, so it always hit [Schema.toInternal](https://github.com/jiateoh/spark/blob/615af5d154d1b15b26dfde3cc3441550eac5a615/python/pyspark/sql/types.py#L1763)\u0027s tuple/list positional branch.  Replacing it with the input tuple or a converted list will execute the same Schema.toInternal branch as before.\n\nThe result is that the extra list, zip, dict, and Row construction is no longer necessary: the end result remains equivalent to the input to the entire `Row(**dict(zip(...)))` sequence: a positionally-ordered tuple/list\n\nAI-assisted + human reviewed/updated\n\n### Why are the changes needed?\n\nThis is a code cleanup/performance optimization. Original code has unnecessary operations that are executed for every row, including: rebuilding closures, extracting field names, building intermediate lists + dicts, and constructing Row objects (which sort by field unnecessarily). These can all add minor overhead while having no effect on the underlying usage.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nUnit tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.6)\n\nCloses #55039 from jiateoh/tws_python_serialization_improvements.\n\nAuthored-by: Jia Teoh \u003cjiateoh@gmail.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "e42a5617dd629ad042d0a41d89396676d1de09d9",
      "tree": "9e1bdc2dc7b8ce4b1c473cb7c0426d960227b11d",
      "parents": [
        "5beaa5b4a1a8c7b2d6067df65af1014de4791b62"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Mon Apr 06 18:33:52 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 18:33:52 2026 +0900"
      },
      "message": "[SPARK-56359][PYTHON] Remove unused ArrowStreamArrowUDFSerializer\n\n### What changes were proposed in this pull request?\n\nRemove `ArrowStreamArrowUDFSerializer` from `serializers.py`. This class is no longer used after:\n- SPARK-55577 refactored `SQL_SCALAR_ARROW_ITER_UDF` to use `ArrowStreamSerializer` directly\n- SPARK-56348 deleted `ArrowBatchUDFSerializer` (child)\n- SPARK-56349 deleted `ArrowStreamAggArrowUDFSerializer` (child)\n\n### Why are the changes needed?\n\nDead code cleanup. Part of SPARK-55384 (Refactor PySpark Serializers).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests. No behavior change.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #55212 from Yicong-Huang/SPARK-56359/cleanup/arrow-stream-arrow-udf-serializer.\n\nAuthored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "5beaa5b4a1a8c7b2d6067df65af1014de4791b62",
      "tree": "e7881c3d97d1f335cbafc200a0d68033412629d5",
      "parents": [
        "732f30bc41f47cde01c8e20dd8793fda3091c49b"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Mon Apr 06 16:50:54 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Mon Apr 06 16:50:54 2026 +0800"
      },
      "message": "[SPARK-56353][BUILD][TESTS] Generate Java-based test JARs dynamically instead of storing pre-built binaries\n\n### What changes were proposed in this pull request?\nThis PR is a part of SPARK-56352 for Java-based test JARs, replacing pre-built test JAR files containing Java classes with dynamic compilation at test time, removing 7 binary JAR files from the repository.\n\nChanges:\n- Add `TestUtils.createJarWithJavaSources()` in `SparkTestUtils.scala` that compiles Java source code with `javac` and packages the resulting classes into a JAR.\n- Externalize Java source files for Hive UDFs/UDAFs/UDTFs to `src/test/resources/` (9 files), loaded at test time via  `getContextClassLoader.getResource()`. These source files were reverse-engineered from the class files in the pre-built JARs, as no original source code existed in the repository.\n- Update test suites to use dynamically generated JARs instead of pre-built ones.\n- Remove deleted JAR entries from `dev/test-jars.txt`.\n\nJARs removed:\n- `core/src/test/resources/TestUDTF.jar`\n- `sql/core/src/test/resources/SPARK-33084.jar`\n- `sql/hive-thriftserver/src/test/resources/TestUDTF.jar`\n- `sql/hive/src/test/noclasspath/hive-test-udfs.jar`\n- `sql/hive/src/test/resources/SPARK-21101-1.0.jar`\n- `sql/hive/src/test/resources/TestUDTF.jar`\n- `sql/hive/src/test/resources/data/files/TestSerDe.jar`\n\n### Why are the changes needed?\nAs noted in the PR discussion [here](https://github.com/apache/spark/pull/50378#pullrequestreview-2715684464):\n\u003e the ultimate goal is to refactor the tests to automatically build the jars instead of using pre-built ones\n\nThis PR achieves that goal for all Java-based test JARs. By generating JARs dynamically at test time, no binary artifacts need to be stored in the source tree, and the release-time workaround becomes unnecessary for these files.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nAll affected test suites pass:\n- SparkContextSuite, SparkSubmitSuite, TaskSetManagerSuite (core)\n- SQLQuerySuite (sql/core)\n- HiveUDFDynamicLoadSuite, HiveDDLSuite, HiveQuerySuite, HiveUDFSuite, SQLQuerySuite (sql/hive)\n- CliSuite, HiveThriftServer2Suites (sql/hive-thriftserver)\n\nNo tests were added or removed. Existing tests now compile Java sources at\nruntime instead of loading pre-built JARs.\n\n### Was this patch authored or co-authored using generative AI tooling?\nKiro CLI / Opus 4.6\n\nCloses #55192 from sarutak/remove-test-jars-ab.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "732f30bc41f47cde01c8e20dd8793fda3091c49b",
      "tree": "b1c9db24ac7db5fa8c7d4813c711fb2077c37eec",
      "parents": [
        "08e5436c7dd84d6150b53f5b6ddbf169ca47035c"
      ],
      "author": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Mon Apr 06 16:49:36 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Mon Apr 06 16:49:36 2026 +0800"
      },
      "message": "[SPARK-56357][BUILD] Upgrade sbt to 1.12.8\n\n### What changes were proposed in this pull request?\nThis pr aims to upgrade sbt from 1.12.4 to 1.12.8\n\n### Why are the changes needed?\nRelease note:\n- https://github.com/sbt/sbt/releases/tag/v1.12.5\n- https://github.com/sbt/sbt/releases/tag/v1.12.6\n- https://github.com/sbt/sbt/releases/tag/v1.12.7\n- https://github.com/sbt/sbt/releases/tag/v1.12.8\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\n- Pass Github Actions\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #55194 from LuciferYang/sbt-1.12.8.\n\nAuthored-by: yangjie01 \u003cyangjie01@baidu.com\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "08e5436c7dd84d6150b53f5b6ddbf169ca47035c",
      "tree": "119c906c9c5f8626f141b471fdb78ddd476b1fd9",
      "parents": [
        "3fb32fd8a7cfccb86ebad789716637c11fd21700"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Mon Apr 06 16:48:05 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Mon Apr 06 16:48:05 2026 +0800"
      },
      "message": "[SPARK-55657][BUILD] Bump Hadoop 3.5.0\n\n### What changes were proposed in this pull request?\n\nHadoop 3.5 now requires Java 17, and should also work with Java 21 and 25.\n\nSome dependency changes, all about the cloud OSS (Object Storage Service) connectors.\n\n- Hadoop official `hadoop-gcp` replaces the Google-maintained `gcs-connector`, see details at HADOOP-19343.\n- HADOOP-19778 removes deprecated WASB code from `hadoop-azure`, also cuts out `jetty-util` and `jetty-util-ajax`\n- `hadoop-azure` and `hadoop-aliyun` pull some new dependencies, but this won\u0027t go official spark binary tgz since it is built without `-Phadoop-cloud`\n- `hadoop-tos` is a new cloud vendor connector, which is excluded, similar to `hadoop-cos`\n\nLICENSE/NOTICE are not changed because the official Hadoop binary release is built without `-Phadoop-cloud`, thus will not be affected.\n\n### Why are the changes needed?\n\nHadoop 3.5.0 [Overview](https://hadoop.apache.org/docs/r3.5.0/index.html) and [Release notes](http://hadoop.apache.org/docs/r3.5.0/hadoop-project-dist/hadoop-common/release/3.5.0/RELEASENOTES.3.5.0.html)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass GHA.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #54448 from pan3793/SPARK-55657.\n\nLead-authored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nCo-authored-by: Cheng Pan \u003cpan3793@gmail.com\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "3fb32fd8a7cfccb86ebad789716637c11fd21700",
      "tree": "82e19dc074dec6d91e37f2d9afe5ce70e5d9e699",
      "parents": [
        "7167dade35c33fae2f30a911e2b2a2440c9757c1"
      ],
      "author": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 16:02:47 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 16:02:47 2026 +0900"
      },
      "message": "[SPARK-56363][INFRA] Add remotes in Spark release image\n\n### What changes were proposed in this pull request?\n\nSee https://github.com/apache/spark/pull/54872 and https://github.com/apache/spark/pull/54853\n\n### Why are the changes needed?\n\nSee https://github.com/apache/spark/pull/54872 and https://github.com/apache/spark/pull/54853\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, dev-only.\n\n### How was this patch tested?\n\nI am testing it in my fork. see https://github.com/HyukjinKwon/spark/actions/runs/24021551482/job/70051292787\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55217 from HyukjinKwon/SPARK-56363.\n\nAuthored-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "7167dade35c33fae2f30a911e2b2a2440c9757c1",
      "tree": "4bb6b07f6d33cfdd0121dd13da9b7f1a27fe2809",
      "parents": [
        "734a86cada12bab42bdf81a3bddf709026348e77"
      ],
      "author": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 15:26:08 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 15:26:08 2026 +0900"
      },
      "message": "[SPARK-55330][INFRA] Add cmake to release Docker base image\n\n### What changes were proposed in this pull request?\n\nAdd cmake to the apt-get install list in `dev/create-release/spark-rm/Dockerfile.base`.\n\n### Why are the changes needed?\n\nSome R packages (e.g. `fs`) build native code with CMake; without cmake, the base image’s R `install.packages` step can fail when those builds run from source.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, dev-only.\n\n### How was this patch tested?\n\nI am testing this in my fork.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55216 from HyukjinKwon/SPARK-55330.\n\nAuthored-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "734a86cada12bab42bdf81a3bddf709026348e77",
      "tree": "cd9b249a7a8e521b92f71edb9b3975e57850e76f",
      "parents": [
        "003855c99fef82931aa3378e61af77963f66ab04"
      ],
      "author": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 14:29:08 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 14:29:08 2026 +0900"
      },
      "message": "[SPARK-56360][INFRA] Wait for base and RM Docker logs in release workflow; avoid hang on early failure\n\n### What changes were proposed in this pull request?\n\nRelease workflow: wait for docker-build-base.log and docker-build.log before tail; stop waiting if the release process exits; include base log in redact/zip; kill tails safely.\n\n### Why are the changes needed?\n\nBase build only writes docker-build-base.log first; waiting only on docker-build.log hid progress and could hang if the script exited before that file existed.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, dev-only\n\n### How was this patch tested?\n\nI will test it in my fork.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55213 from HyukjinKwon/dont-hang.\n\nAuthored-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "003855c99fef82931aa3378e61af77963f66ab04",
      "tree": "d987040050cd682faf313afba0e0c8d0ee418f4e",
      "parents": [
        "b16825455b2f1a62262a88da45417d2370d7b5aa"
      ],
      "author": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 10:47:20 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 10:47:20 2026 +0900"
      },
      "message": "[SPARK-55115][INFRA][FOLLOW-UP] Fix release workflow log tailing for base Docker image build\n\n### What changes were proposed in this pull request?\n\nTail and wait on `docker-build-base.log` before `docker-build.log`; kill that tail; add the base log to redact/zip.\n\n### Why are the changes needed?\n\nThe base image build logs only to `docker-build-base.log` first, so the old workflow showed no tail until `docker-build.log` existed.\n\n### Does this PR introduce _any_ user-facing change?\n\nDev-only.\n\n### How was this patch tested?\n\nNot run in CI from here; verify with a fork workflow run.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55211 from HyukjinKwon/show-logs-in-dryrun.\n\nAuthored-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "b16825455b2f1a62262a88da45417d2370d7b5aa",
      "tree": "16dc78c474b054d7a069dc45482a2702d2f0d4e2",
      "parents": [
        "6767941022e36d5498cc0dd51e58536a568451dd"
      ],
      "author": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Mon Apr 06 07:59:18 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:59:18 2026 +0900"
      },
      "message": "[SPARK-56358][BUILD] Add gson version override to SBT build to align with Maven\n\n### What changes were proposed in this pull request?\nThis pr add `com.google.code.gson:gson` to `DependencyOverrides` in `project/SparkBuild.scala`, reading the version from `gson.version` property in the effective pom.\n\n### Why are the changes needed?\nMaven\u0027s root `pom.xml` declares `gson.version\u003d2.13.2` in `\u003cdependencyManagement\u003e`, forcing all modules to resolve that version. The SBT build\u0027s `DependencyOverrides` object already overrides guava, jackson, avro, slf4j, xz, and jline to match Maven, but **gson was missing**.\n\nWithout this override, Coursier resolves gson `2.11.0` from transitive dependencies (`grpc-core:1.76.0` → `gson:2.11.0`, `protobuf-java-util:4.33.5` → `gson:2.11.0`), while Maven resolves `2.13.2`.\n\nThis causes differences in assembled jars for modules that shade gson (e.g., `connect-client-jvm`): gson 2.11.0 → 2.13.2 underwent internal class restructuring (e.g., `$Gson$Types` → `GsonTypes`, `TypeAdapters$EnumTypeAdapter` → standalone `EnumTypeAdapter`), resulting in ~20 class differences.\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\n- Pass Github Actions\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #55204 from LuciferYang/SPARK-56358.\n\nAuthored-by: yangjie01 \u003cyangjie01@baidu.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "6767941022e36d5498cc0dd51e58536a568451dd",
      "tree": "9922ad3a9fa9d593455c7af2f5b8ecf624789fd9",
      "parents": [
        "08b1390cb2e62100b0b95663e8f1c8214e537905"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Mon Apr 06 07:57:07 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:57:07 2026 +0900"
      },
      "message": "[SPARK-55667][PYTHON][CONNECT][FOLLOW-UP] Remove arguments in check_dependencies\n\n### What changes were proposed in this pull request?\n\nRemove the argument for `check_dependencies` in an end to end test\n\n### Why are the changes needed?\n\nIn https://github.com/apache/spark/pull/54463 we changed `check_dependencies` but this scala test was not modified properly. CI did not catch it because when `check_dependencies` is given the wrong number of arguments, it just raises an error - the same behavior as the dependency is not there. So the related tests believe we are missing dependencies and skip all the tests.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nLocally confirmed the test runs now.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55189 from gaogaotiantian/fix-dependency-check.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "08b1390cb2e62100b0b95663e8f1c8214e537905",
      "tree": "a190fa70757e34533ad560ad9dc015fe257ea323",
      "parents": [
        "561b7b96e3157190d303d4039ce7fe3eda41b499"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Mon Apr 06 07:56:22 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:56:22 2026 +0900"
      },
      "message": "[SPARK-56349][PYTHON] Remove unused ArrowStreamAggArrowUDFSerializer\n\n### What changes were proposed in this pull request?\n\nRemove `ArrowStreamAggArrowUDFSerializer` from `serializers.py`. This class is no longer used after SPARK-56123 refactored `SQL_GROUPED_AGG_ARROW_UDF` / `SQL_GROUPED_AGG_ARROW_ITER_UDF` and SPARK-56189 refactored `SQL_WINDOW_AGG_ARROW_UDF` to use `ArrowStreamSerializer` directly.\n\n### Why are the changes needed?\n\nDead code cleanup. Part of SPARK-55384 (Refactor PySpark Serializers).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests. No behavior change.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #55186 from Yicong-Huang/SPARK-56349/cleanup/agg-arrow-udf-serializer.\n\nAuthored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "561b7b96e3157190d303d4039ce7fe3eda41b499",
      "tree": "bc0ca1246bb4438ed179436d1994b6bbee21024d",
      "parents": [
        "c6a198eab0f4c9789e8977db70a228ce6d57b66f"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Mon Apr 06 07:55:52 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:55:52 2026 +0900"
      },
      "message": "[SPARK-56348][PYTHON] Remove unused ArrowBatchUDFSerializer\n\n### What changes were proposed in this pull request?\n\nRemove `ArrowBatchUDFSerializer` from `serializers.py`. This class is no longer used after SPARK-55902 refactored `SQL_ARROW_BATCHED_UDF` to use `ArrowStreamSerializer` directly.\n\n### Why are the changes needed?\n\nDead code cleanup. Part of SPARK-55384 (Refactor PySpark Serializers).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests. No behavior change.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #55185 from Yicong-Huang/SPARK-56348/cleanup/arrow-batch-udf-serializer.\n\nAuthored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "c6a198eab0f4c9789e8977db70a228ce6d57b66f",
      "tree": "bc9b0cf0ca7fc2deaa44ce489fe8946850b9b617",
      "parents": [
        "c1dd15c25420ad66c2218f8ec1ee954d25f99da3"
      ],
      "author": {
        "name": "Fangchen Li",
        "email": "fangchen.li@outlook.com",
        "time": "Mon Apr 06 07:55:24 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:55:24 2026 +0900"
      },
      "message": "[MINOR][PYTHON] Fix PySparkException failing when messageParameters is omitted\n\n### What changes were proposed in this pull request?\n\nIn PySparkException, set messageParameters to an empty dict when it\u0027s not passed.\n\n### Why are the changes needed?\n\nDuring PySparkException construction, if messageParameters is not given, the original code will cast None to a dict. Discovered when working on #54143.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nUnittest.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.6\n\nCloses #55181 from fangchenli/fix-malformed-error-messageParameters.\n\nAuthored-by: Fangchen Li \u003cfangchen.li@outlook.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "c1dd15c25420ad66c2218f8ec1ee954d25f99da3",
      "tree": "5992be5cc41b4c607d4219945c6d36ad5866157e",
      "parents": [
        "49e06b331f8e39376f8541a66e716cd1b0979251"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Mon Apr 06 07:54:12 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:54:12 2026 +0900"
      },
      "message": "[SPARK-49793][PYTHON][TESTS][FOLLOW-UP] Fix test_caching in connect mode\n\n### What changes were proposed in this pull request?\n\nIf we are testing connect client, use the correct master and skip `test_caching`.\n\n### Why are the changes needed?\n\nCI is failing https://github.com/apache/spark/actions/runs/23917930072/job/69756535467 - connect tests can\u0027t have `local[1]`. This pattern is used in other tests.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55177 from gaogaotiantian/fix-test-caching.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "49e06b331f8e39376f8541a66e716cd1b0979251",
      "tree": "3fb4d1121ea6563d7990078e25cda3dc494c078f",
      "parents": [
        "ed01691908d3d2ce357bfdedb724e4cbdca27a72"
      ],
      "author": {
        "name": "Xianming Lei",
        "email": "jerrylei@apache.org",
        "time": "Mon Apr 06 07:45:24 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:45:24 2026 +0900"
      },
      "message": "[SPARK-56317][SQL] GetJsonObjectEvaluator should reuse output buffer\n\n### What changes were proposed in this pull request?\nGetJsonObjectEvaluator should reuse output buffer.\n\n### Why are the changes needed?\nCurrently, GetJsonObjectEvaluator.evaluate() creates a new ByteArrayOutputStream on every invocation. Since the evaluator is reused across rows, this results in unnecessary object allocation and GC pressure. We can reuse the output buffer by promoting it to a class-level lazy val and calling reset() before each use, this is consistent with StructsToJsonEvaluator, which reuses a CharArrayWriter across rows.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nExisting UTs.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #55134 from leixm/SPARK-56317.\n\nAuthored-by: Xianming Lei \u003cjerrylei@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "ed01691908d3d2ce357bfdedb724e4cbdca27a72",
      "tree": "50ecedc112c5f7b535e1be2c6c40d850491f9142",
      "parents": [
        "263c97643ab64bab139c29f91c6a2e32680587f0"
      ],
      "author": {
        "name": "donaldchai",
        "email": "donald.chai@gmail.com",
        "time": "Mon Apr 06 07:35:43 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Mon Apr 06 07:35:43 2026 +0900"
      },
      "message": "[MINOR][DOCS] Fix a typo of \"KLL\" initialism\n\n### What changes were proposed in this pull request?\n\nFix documentation to use the correct initialism for \"KLL\" (as used in https://datasketches.apache.org/docs/KLL/KLLSketch.html and elsewhere).\n\n### Why are the changes needed?\n\nAttributes this to the correct paper authors.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nN/A\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55072 from donaldchai/donaldchai-patch-1.\n\nAuthored-by: donaldchai \u003cdonald.chai@gmail.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "263c97643ab64bab139c29f91c6a2e32680587f0",
      "tree": "89a58e0175b4a0ecdd51b6b73b1c852453babc00",
      "parents": [
        "1ea48e4b68a17a7d0170b4ecca96ca427f217059"
      ],
      "author": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Sun Apr 05 16:06:11 2026 +0800"
      },
      "committer": {
        "name": "yangjie01",
        "email": "yangjie01@baidu.com",
        "time": "Sun Apr 05 16:06:11 2026 +0800"
      },
      "message": "[SPARK-56209][BUILD][FOLLOWUP] Exclude Netty transitive dependencies from Vert.x in Kubernetes modules\n\n### What changes were proposed in this pull request?\nThis PR adds Netty exclusions to the `vertx-core` and `vertx-web-client` dependencies in both `resource-managers/kubernetes/core/pom.xml` and `resource-managers/kubernetes/integration-tests/pom.xml`.\n\n```xml\n\u003cexclusions\u003e\n  \u003cexclusion\u003e\n    \u003cgroupId\u003eio.netty\u003c/groupId\u003e\n    \u003cartifactId\u003e*\u003c/artifactId\u003e\n  \u003c/exclusion\u003e\n\u003c/exclusions\u003e\n```\n\n### Why are the changes needed?\nFix maven daily tests:\n\n- https://github.com/apache/spark/actions/runs/23947759281/job/69847651473\n\n```\n*** RUN ABORTED ***\nA needed class was not found. This could be due to an error in your runpath. Missing class: io/netty/channel/IoHandler\n  java.lang.NoClassDefFoundError: io/netty/channel/IoHandler\n  at java.base/java.lang.ClassLoader.defineClass1(Native Method)\n  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)\n  at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)\n  at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)\n  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)\n  at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:78)\n  ...\n  Cause: java.lang.ClassNotFoundException: io.netty.channel.IoHandler\n  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)\n  at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)\n  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)\n  at java.base/java.lang.ClassLoader.defineClass1(Native Method)\n  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)\n  at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)\n  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)\n  ...\n[WARNING] The requested profile \"hive\" could not be activated because it does not exist.\n[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project spark-kubernetes_2.13: There are test failures -\u003e [Help 1]\n[ERROR]\n[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.\n[ERROR] Re-run Maven using the -X switch to enable full debug logging.\n[ERROR]\n[ERROR] For more information about the errors and possible solutions, please read the following articles:\n[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException\n[ERROR]\n[ERROR] After correcting the problems, you can resume the build with the command\n[ERROR]   mvn \u003cargs\u003e -rf :spark-kubernetes_2.13\n\nError: Process completed with exit code 1.\n```\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\ntest with maven: https://github.com/LuciferYang/spark/runs/69873815559\n\n![image](https://github.com/user-attachments/assets/fdbbdbf8-553d-4716-9602-e1ee4ae8bc63)\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #55188 from LuciferYang/fix-vertx-netty-exclusion.\n\nAuthored-by: yangjie01 \u003cyangjie01@baidu.com\u003e\nSigned-off-by: yangjie01 \u003cyangjie01@baidu.com\u003e\n"
    },
    {
      "commit": "1ea48e4b68a17a7d0170b4ecca96ca427f217059",
      "tree": "f45b327716c06be894801771c9ddef648df5c725",
      "parents": [
        "a20bfc7cafca3c309c5e61282ec7f64b720f2411"
      ],
      "author": {
        "name": "Herman van Hövell",
        "email": "herman@databricks.com",
        "time": "Sun Apr 05 14:31:22 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Sun Apr 05 14:31:22 2026 +0900"
      },
      "message": "[SPARK-56007][CONNECT] Fix ArrowDeserializer to use positional binding for rows\n\n### What changes were proposed in this pull request?\nThis PR switches RowEncoder deserialization in the Spark Connect Scala client from name-based lookup to positional binding to correctly handle duplicate column names.\n\n### Why are the changes needed?\nThe Spark Connect Scala client can\u0027t handle with rows with duplicate column names. This is regression w.r.t. classic.\n\n### Does this PR introduce _any_ user-facing change?\nYes. It fixes a bug.\n\n### How was this patch tested?\nI added tests to ArrowEncoderSuite.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude Code v2.1.76\n\nCloses #54832 from hvanhovell/SPARK-56007.\n\nAuthored-by: Herman van Hövell \u003cherman@databricks.com\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "a20bfc7cafca3c309c5e61282ec7f64b720f2411",
      "tree": "5a0f902df67b67245bacaf3271a5af42b69f6931",
      "parents": [
        "842eb7b7c2dbf7b20e8c59d0c351e79f760dcc74"
      ],
      "author": {
        "name": "Kousuke Saruta",
        "email": "sarutak@amazon.co.jp",
        "time": "Sun Apr 05 12:53:39 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Sun Apr 05 12:53:39 2026 +0900"
      },
      "message": "[SPARK-56356][BUILD] Fix an issue in release build caused by error on fetching artifacts\n\n### What changes were proposed in this pull request?\nThis PR fixes a release build failure which recently happens at document generation phase due to error on fetching artifacts.\n\nhttps://github.com/apache/spark/actions/runs/23083193766/job/67055792458\n\n```\nError:  lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:\nError:  file:/home/spark-rm/.m2/repository/io/netty/netty-codec-protobuf/4.2.10.Final/netty-codec-protobuf-4.2.10.Final.jar: not found: /home/spark-rm/.m2/repository/io/netty/netty-codec-protobuf/4.2.10.Final/netty-codec-protobuf-4.2.10.Final.jar\nError:  file:/home/spark-rm/.m2/repository/io/netty/netty-codec-marshalling/4.2.10.Final/netty-codec-marshalling-4.2.10.Final.jar: not found: /home/spark-rm/.m2/repository/io/netty/netty-codec-marshalling/4.2.10.Final/netty-codec-marshalling-4.2.10.Final.jar\nError:\nError:  \tat lmcoursier.internal.shaded.coursier.Artifacts$.$anonfun$fetchArtifacts$9(Artifacts.scala:365)\nError:  \tat lmcoursier.internal.shaded.coursier.util.Task$.$anonfun$flatMap$extension$1(Task.scala:14)\nError:  \tat lmcoursier.internal.shaded.coursier.util.Task$.$anonfun$flatMap$extension$1$adapted(Task.scala:14)\nError:  \tat lmcoursier.internal.shaded.coursier.util.Task$.wrap(Task.scala:82)\nError:  \tat lmcoursier.internal.shaded.coursier.util.Task$.$anonfun$flatMap$2(Task.scala:14)\nError:  \tat scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)\nError:  \tat scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:51)\nError:  \tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:74)\nError:  \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\nError:  \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\nError:  \tat java.base/java.lang.Thread.run(Thread.java:840)\nError:  Caused by: lmcoursier.internal.shaded.coursier.cache.ArtifactError$NotFound: not found: /home/spark-rm/.m2/repository/io/netty/netty-codec-protobuf/4.2.10.Final/netty-codec-protobuf-4.2.10.Final.jar\nError:  \tat lmcoursier.internal.shaded.coursier.cache.internal.Downloader.$anonfun$checkFileExists$1(Downloader.scala:603)\nError:  \tat scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)\nError:  \tat scala.util.Success.$anonfun$map$1(Try.scala:255)\nError:  \tat scala.util.Success.map(Try.scala:213)\nError:  \tat scala.concurrent.Future.$anonfun$map$1(Future.scala:292)\nError:  \tat scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:42)\nError:  \tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:74)\nError:  \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\nError:  \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\nError:  \tat java.base/java.lang.Thread.run(Thread.java:840)\nError:  (streaming-kinesis-asl / update) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:\nError:  file:/home/spark-rm/.m2/repository/io/netty/netty-codec-protobuf/4.2.10.Final/netty-codec-protobuf-4.2.10.Final.jar: not found: /home/spark-rm/.m2/repository/io/netty/netty-codec-protobuf/4.2.10.Final/netty-codec-protobuf-4.2.10.Final.jar\nError:  file:/home/spark-rm/.m2/repository/io/netty/netty-codec-marshalling/4.2.10.Final/netty-codec-marshalling-4.2.10.Final.jar: not found: /home/spark-rm/.m2/repository/io/netty/netty-codec-marshalling/4.2.10.Final/netty-codec-marshalling-4.2.10.Final.jar\nError:  Total time: 368 s (0:06:08.0), completed Mar 14, 2026, 8:40:17 AM\n                    ------------------------------------------------\n      Jekyll 4.4.1   Please append `--trace` to the `build` command\n                     for any additional information or backtrace.\n                    ------------------------------------------------\n```\n\nThis issue is similar to SPARK-34762 and SPARK-37302 in that there are pom files but are not corresponding jar files under `.m2` for some dependencies.\nIn this case, the following command is executed through [make-distribution.sh](https://github.com/apache/spark/blob/d9c8eda57e22f65d0443cab7078c632462c11272/dev/make-distribution.sh#L183) and downloads pom files for `xz:1.10`, `netty-codec-protobuf` and `netty-codec-marshalling`\n\n```\nbuild/mvn clean package -DskipTests -Dmaven.javadoc.skip\u003dtrue -Dmaven.scaladoc.skip\u003dtrue -Dmaven.source.skip -Dcyclonedx.skip\u003dtrue -B -Pyarn -Pkubernetes -Phadoop-3 -Phive -Phive-thriftserver\n```\n\nAnd when building documents, the following command is executed through [build_api_docs.rb](https://github.com/apache/spark/blob/d9c8eda57e22f65d0443cab7078c632462c11272/docs/_plugins/build_api_docs.rb#L48) and tries to download the dependencies.\n\n```\nNO_PROVIDED_SPARK_JARS\u003d0 build/sbt -Phive -Pkinesis-asl clean package\n```\n\nRegarding xz, `1.12` is declared in `pom.xml` so this PR fixes `SparkBuild.scala` to pin the version.\nRegarding `netty-codec-protobuf` and `netty-codec-marshalling`, they are declared in pom.xml to be excluded. So, this PR fixes `SparkBuild.scala` to exclude them.\n\n### Why are the changes needed?\nTo recover the release build.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nThe following command successfully finishes on my laptop.\n```\n$ build/mvn clean package -DskipTests -Dmaven.javadoc.skip\u003dtrue -Dmaven.scaladoc.skip\u003dtrue -Dmaven.source.skip -Dcyclonedx.skip\u003dtrue -B -Pyarn -Pkubernetes -Phadoop-3 -Phive -Phive-thriftserver\n$ SKIP_SCALADOC\u003d1 SKIP_RDOC\u003d1 SKIP_SQLDOC\u003d1 bundle exec jekyll build\n```\n\nNote that to build documents, please follow the instructions in `docs/README.md`\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #55198 from sarutak/fix-dependency-resolution-issue.\n\nAuthored-by: Kousuke Saruta \u003csarutak@amazon.co.jp\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "842eb7b7c2dbf7b20e8c59d0c351e79f760dcc74",
      "tree": "57bc3e1c6165409626572fb8303c02deed2be576",
      "parents": [
        "d9c8eda57e22f65d0443cab7078c632462c11272"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Sun Apr 05 10:32:44 2026 +0900"
      },
      "committer": {
        "name": "Hyukjin Kwon",
        "email": "gurwls223@apache.org",
        "time": "Sun Apr 05 10:32:44 2026 +0900"
      },
      "message": "[SPARK-54938][PYTHON][TEST][FOLLOW-UP] Fix inferred time unit for pandas \u003e\u003d 3\n\n### What changes were proposed in this pull request?\nFix inferred time unit for pandas \u003e\u003d 3\n\n### Why are the changes needed?\nthere is behavior change in pandas 3\n\n### Does this PR introduce _any_ user-facing change?\nNo, test-only\n\n### How was this patch tested?\nmanually check\n\npandas\u003d2.3.3\n```\nIn [7]: pd.__version__\nOut[7]: \u00272.3.3\u0027\n\nIn [8]: pd.Series(pd.to_datetime([\"2024-01-01\", \"2024-01-02\"])).dtype\nOut[8]: dtype(\u0027\u003cM8[ns]\u0027)\n\nIn [9]: pa.array(pd.Series(pd.to_datetime([\"2024-01-01\", \"2024-01-02\"]))).type\nOut[9]: TimestampType(timestamp[ns])\n```\n\npandas\u003d3.0.1\n```\nIn [6]: pd.__version__\nOut[6]: \u00273.0.1\u0027\n\nIn [7]: pd.Series(pd.to_datetime([\"2024-01-01\", \"2024-01-02\"])).dtype\nOut[7]: dtype(\u0027\u003cM8[us]\u0027)\n\nIn [8]: pa.array(pd.Series(pd.to_datetime([\"2024-01-01\", \"2024-01-02\"]))).type\nOut[8]: TimestampType(timestamp[us])\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\nCo-authored-by: Claude code (Opus 4.6)\n\nCloses #55158 from zhengruifeng/fix-pyarrow-ts-inference.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Hyukjin Kwon \u003cgurwls223@apache.org\u003e\n"
    },
    {
      "commit": "d9c8eda57e22f65d0443cab7078c632462c11272",
      "tree": "86e2a3fd0e9d157b2aa9fb4c68d7bac1d231479a",
      "parents": [
        "504060ffdfee26608a49b9241549f6062f69971d"
      ],
      "author": {
        "name": "Marcin Wojtyczka",
        "email": "marcin.wojtyczka@databricks.com",
        "time": "Fri Apr 03 11:08:52 2026 -0700"
      },
      "committer": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Fri Apr 03 11:08:52 2026 -0700"
      },
      "message": "[SPARK-56322][CONNECT][PYTHON] Fix TypeError when self-joining observed DataFrames\n\n### What changes were proposed in this pull request?\n\nFixing bug: https://issues.apache.org/jira/browse/SPARK-56322\n\nReplace `dict(**a, **b)` with `{**a, **b}` dict literal syntax when merging observations across plan branches in `Join`, `AsOfJoin`, `LateralJoin`, `SetOperation`, and `CollectMetrics`.\n\n### Why are the changes needed?\n\nWhen a DataFrame with `.observe()` is filtered into two subsets and then self-joined, both branches of the join carry the same `Observation` instance under the same name. The `observations` property merges left and right observations using `dict(**left, **right)`, which raises `TypeError` when both dicts contain the same key:\n\n```\nTypeError: dict() got multiple values for keyword argument \u0027my_observation\u0027\n```\n\nThis is a Python semantics issue — `dict(**a, **b)` treats each key as a keyword argument, and Python does not allow duplicate keyword arguments. The dict literal `{**a, **b}` does not have this restriction and silently lets the last value win, which is correct here since both values are the same `Observation` instance originating from the same `.observe()` call.\n\n**Why \"last value wins\" is safe here:** When a DataFrame is observed and then branched (filtered, aliased), both branches inherit a reference to the *same* `Observation` instance. The duplicate keys in the merge always map to the identical Python object — there is no scenario where two different `Observation` instances share the same name within a single plan tree. Therefore, deduplication does not lose any data.\n\nThis pattern affects any workflow that:\n- Observes a DataFrame (e.g., for monitoring row counts or data quality metrics)\n- Filters or transforms it into multiple subsets\n- Joins the subsets back together\n\nThis is common in data quality pipelines (split into valid/invalid rows, then rejoin) and ETL workflows that branch and merge.\n\n### How to reproduce\n\n```python\nfrom pyspark.sql import Observation\nfrom pyspark.sql.functions import count, lit\n\nobs \u003d Observation(\"my_observation\")\ndf \u003d (\n    spark.range(100)\n    .selectExpr(\"id\", \"case when id \u003c 10 then \u0027A\u0027 else \u0027B\u0027 end as group_key\")\n    .observe(obs, count(lit(1)).alias(\"row_count\"))\n)\n\n# Filter into two subsets — both carry the same observation\ndf1 \u003d df.where(\"id \u003c 20\")\ndf2 \u003d df.where(\"id % 2 \u003d\u003d 0\")\n\n# Self-join triggers the bug\njoined \u003d df1.alias(\"a\").join(df2.alias(\"b\"), on\u003d[\"id\"], how\u003d\"inner\")\njoined.collect()\n# TypeError: dict() got multiple values for keyword argument \u0027my_observation\u0027\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nYes — self-joining an observed DataFrame no longer raises `TypeError`. No behavior change for joins where observations don\u0027t overlap (the common case).\n\n### How was this patch tested?\n\nAdded unit tests in `test_connect_plan.py` covering:\n- `Join` with duplicate observation names (the reported scenario)\n- `Join` with distinct observations (regression check)\n- `SetOperation` with duplicate observation names\n- `CollectMetrics` with parent sharing the same observation name\n\nAll tests fail on the unpatched code with the exact `TypeError` and pass with the fix.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes, Claude Code for testing\n\nCloses #55140 from mwojtyczka/fix-observation-self-join.\n\nLead-authored-by: Marcin Wojtyczka \u003cmarcin.wojtyczka@databricks.com\u003e\nCo-authored-by: Greg Hansen \u003cgregory.hansen@databricks.com\u003e\nSigned-off-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\n"
    },
    {
      "commit": "504060ffdfee26608a49b9241549f6062f69971d",
      "tree": "95455c1d968e4a3f424ac76d813011a51c50483d",
      "parents": [
        "5f5fc89f2a9cad474eaac6b3ade4da342785abf4"
      ],
      "author": {
        "name": "Vladan Vasić",
        "email": "vladan.vasic@databricks.com",
        "time": "Fri Apr 03 23:40:36 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Fri Apr 03 23:40:36 2026 +0800"
      },
      "message": "[SPARK-56347][TEST] Fix TOCTOU race in DockerJDBCIntegrationSuite port allocation\n\n### What changes were proposed in this pull request?\n\nDocker integration tests fail in CI when running multiple \u003e20 instances of same test on same machine. The suite aborts in `beforeAll` with `address already in use` before any test executes.\n\nRoot cause is a TOCTOU race: `ServerSocket(0)` finds a free port, closes the socket, then passes the port to Docker later. Between close and Docker bind, another parallel test grabs the same port.\n\n- Use Docker port 0 (auto-assign) instead of pre-allocating via `ServerSocket(0)`, eliminating the TOCTOU entirely\n- Resolve the actual Docker-assigned port after container startup via `resolvedMappedPort` / `inspectContainer`\n- Change `port` declarations from `val` to `def` in test base traits and subclasses so they read the resolved port at access time\n\n### Why are the changes needed?\n\nAvoiding flakiness in tests.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nThe change is test itself.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes.\n\nCloses #55182 from vladanvasi-db/docker-jdbc-stability.\n\nAuthored-by: Vladan Vasić \u003cvladan.vasic@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "5f5fc89f2a9cad474eaac6b3ade4da342785abf4",
      "tree": "5d39a16383c49ef1d582b8ecde4237bfaeee38a5",
      "parents": [
        "cba57054200b9992e61800243b55612bec5944e7"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Apr 03 18:19:05 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Apr 03 18:19:05 2026 +0800"
      },
      "message": "[SPARK-56329][PYTHON] Fix all E721 type comparison violations\n\n### What changes were proposed in this pull request?\nFix all E721 (`type-comparison`) violations across the PySpark codebase and remove E721 from the `pyproject.toml` ruff ignore list.\n\n### Why are the changes needed?\nE721 was ignored in `pyproject.toml` with a TODO comment (\"too many for now\"). This PR fixes all 78 violations across 24 files so the rule can be enforced going forward.\n\nChanges:\n- `type(x) \u003d\u003d SomeType` → `isinstance(x, SomeType)`\n- `type(x) !\u003d SomeType` → `not isinstance(x, SomeType)`\n- `type(x) \u003d\u003d type(y)` / `type(x) !\u003d type(y)` → `type(x) is type(y)` / `type(x) is not type(y)`\n- `type(x) in [A, B]` → `isinstance(x, (A, B))`\n- `variant_type \u003d\u003d dict` → `variant_type is dict` (type-identity comparisons)\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nCI\n\n### Was this patch authored or co-authored using generative AI tooling?\nCo-authored-by: Claude code (Opus 4.6)\n\nCloses #55150 from zhengruifeng/fix-isinstance-pandas-types.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "cba57054200b9992e61800243b55612bec5944e7",
      "tree": "eaccd42545f1a5891c4b3bfdf23ab1657bd54fc0",
      "parents": [
        "50f717943a39631ac16ceaefbf0f097caaddde13"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Fri Apr 03 15:11:15 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Apr 03 15:11:15 2026 +0800"
      },
      "message": "[SPARK-56341][PYTHON][DOCS] Fix outdated PyArrow minimum version in arrow_pandas.rst\n\n### What changes were proposed in this pull request?\n\n- Updated the minimum PyArrow version from 11.0.0 to 18.0.0 in `python/docs/source/tutorial/sql/arrow_pandas.rst`.\n- Added `python/docs/source/tutorial/sql/arrow_pandas.rst` to the version-change reminder comments in `python/packaging/classic/setup.py`, `python/packaging/client/setup.py`, and `python/packaging/connect/setup.py` so this file is not missed in future version bumps.\n\n### Why are the changes needed?\n\nThe documentation in `arrow_pandas.rst` still states the minimum PyArrow version is 11.0.0, but it was raised to 18.0.0 in SPARK-51334. The setup.py reminder comments also do not reference this doc file, which is why it was missed.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nDocumentation-only change; no tests needed.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55172 from Yicong-Huang/SPARK-56341.\n\nAuthored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "50f717943a39631ac16ceaefbf0f097caaddde13",
      "tree": "742a7ca88da8df7386d20c70b7e435c7773848bc",
      "parents": [
        "e1305252f12211d85674ea10d473e873d13b1a9c"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Fri Apr 03 15:07:48 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Apr 03 15:07:48 2026 +0800"
      },
      "message": "[SPARK-56313][PYTHON][FOLLOWUP] Remove rddsampler from mypy exception list\n\n### What changes were proposed in this pull request?\n\nRemove rddsampler out of `mypy.ini`.\n\n### Why are the changes needed?\n\nThis should be part of https://github.com/apache/spark/pull/55122 but it was forgotten.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nLocal mypy test passed.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55171 from gaogaotiantian/remove-rdd-sampler-ignore.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "e1305252f12211d85674ea10d473e873d13b1a9c",
      "tree": "64267595de33456043f9288114df9d0236b2b0a0",
      "parents": [
        "7cb54df2a42520311989f1eb3f7b0898b7119a3b"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Fri Apr 03 15:05:28 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Apr 03 15:05:28 2026 +0800"
      },
      "message": "[SPARK-55902][PYTHON] Refactor SQL_ARROW_BATCHED_UDF\n\n### What changes were proposed in this pull request?\n\nRefactor `SQL_ARROW_BATCHED_UDF` (non-legacy path) to use `ArrowStreamSerializer` as pure I/O, moving Arrow-to-Python and Python-to-Arrow conversion logic from `ArrowBatchUDFSerializer` into `read_udfs()` in `worker.py`.\n\n### Why are the changes needed?\n\nPart of SPARK-55388.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n`COLUMNS\u003d120 asv run --python\u003dsame --bench \"ArrowBatched\" --attribute \"repeat\u003d(3,5,5.0)\"` before and after:\n\n**ArrowBatchedUDFTimeBench** - Before (master):\n```\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n--                                       udf\n------------------- ----------------------------------------------\n      scenario       identity_udf   stringify_udf   nullcheck_udf\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n  sm_batch_few_col    62.1+-0.2ms      66.1+-0.8ms      61.2+-0.1ms\n sm_batch_many_col    154+-0.4ms       155+-0.4ms       154+-0.3ms\n  lg_batch_few_col    148+-0.3ms       157+-0.4ms       147+-0.5ms\n lg_batch_many_col     623+-2ms         624+-2ms         620+-3ms\n     pure_ints        220+-0.5ms       231+-0.7ms        220+-6ms\n    pure_floats       224+-0.8ms        262+-1ms        225+-0.7ms\n    pure_strings       414+-1ms        415+-0.6ms        404+-1ms\n    mixed_types        311+-1ms        318+-0.8ms       308+-0.7ms\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n```\n\n**ArrowBatchedUDFTimeBench** - After (this PR):\n```\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n--                                       udf\n------------------- ----------------------------------------------\n      scenario       identity_udf   stringify_udf   nullcheck_udf\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n  sm_batch_few_col    59.9+-0.4ms     63.5+-0.06ms     59.5+-0.3ms\n sm_batch_many_col    149+-0.1ms       150+-0.3ms       149+-0.2ms\n  lg_batch_few_col    144+-0.4ms       153+-0.8ms       143+-0.5ms\n lg_batch_many_col     602+-2ms         603+-2ms         598+-3ms\n     pure_ints         216+-1ms       226+-0.7ms       214+-0.5ms\n    pure_floats        215+-1ms       251+-0.9ms        215+-1ms\n    pure_strings      391+-0.8ms        395+-2ms       385+-0.9ms\n    mixed_types       299+-0.7ms        307+-1ms         297+-1ms\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n```\n\n**ArrowBatchedUDFPeakmemBench** - Before (master):\n```\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n--                                       udf\n------------------- ----------------------------------------------\n      scenario       identity_udf   stringify_udf   nullcheck_udf\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n  sm_batch_few_col       119M            119M            118M\n sm_batch_many_col       123M            123M            123M\n  lg_batch_few_col       124M            124M            122M\n lg_batch_many_col       159M            160M            159M\n     pure_ints           122M            123M            122M\n    pure_floats          124M            125M            123M\n    pure_strings         125M            125M            124M\n    mixed_types          123M            124M            123M\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n```\n\n**ArrowBatchedUDFPeakmemBench** - After (this PR):\n```\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n--                                       udf\n------------------- ----------------------------------------------\n      scenario       identity_udf   stringify_udf   nullcheck_udf\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n  sm_batch_few_col       119M            119M            119M\n sm_batch_many_col       123M            123M            123M\n  lg_batch_few_col       123M            124M            122M\n lg_batch_many_col       160M            161M            160M\n     pure_ints           122M            123M            122M\n    pure_floats          124M            125M            123M\n    pure_strings         125M            125M            125M\n    mixed_types          124M            124M            124M\n\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d \u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\n```\n\n**Summary**: Latency improved 3-5% across all scenarios (e.g., `identity_udf` on `pure_strings` dropped from 414ms to 391ms). This is likely due to eliminating the serializer class overhead and reducing indirection layers. Peak memory is unchanged, as expected since the refactor only reorganizes logic without changing data layout or buffering strategy.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #54705 from Yicong-Huang/SPARK-55902/refactor/arrow-batch-udf.\n\nLead-authored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nCo-authored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "7cb54df2a42520311989f1eb3f7b0898b7119a3b",
      "tree": "0b2b3949bf6ec40ca7f97a91e2215f7e2ec67e51",
      "parents": [
        "e2f15f6ee3c0dd7985ba9fadbf3d1abfbeed3e49"
      ],
      "author": {
        "name": "Richard Chen",
        "email": "r.chen@databricks.com",
        "time": "Fri Apr 03 12:50:07 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Fri Apr 03 12:50:07 2026 +0900"
      },
      "message": "[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node\n\n### What changes were proposed in this pull request?\n\ntitle - adds a project operator before the streaming dedupe node when any of the keys are double/floats\n\n### Why are the changes needed?\n\nif two NaNs have different bits, but are both NaN, they won\u0027t be deduplicated. They should be deduplicated.\n\n### Does this PR introduce _any_ user-facing change?\n\nyes - NaNs with different bit patterns are now deduplicated/considered duplicates.\n\n### How was this patch tested?\n\nadde UT\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: claude 2.1.88 (Claude Code)\n\nCloses #55088 from richardc-db/richardc-db/streaming_dedupe_nan_and_0s.\n\nAuthored-by: Richard Chen \u003cr.chen@databricks.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "e2f15f6ee3c0dd7985ba9fadbf3d1abfbeed3e49",
      "tree": "8008bc0b4551372d77c318a6c20b97b7913a0f97",
      "parents": [
        "ccd4206c7e51144ee0e59454e1ceda1d349f6c9d"
      ],
      "author": {
        "name": "Kavpreet Grewal",
        "email": "kavpreet.grewal@databricks.com",
        "time": "Fri Apr 03 12:19:41 2026 +0900"
      },
      "committer": {
        "name": "Jungtaek Lim",
        "email": "kabhwan.opensource@gmail.com",
        "time": "Fri Apr 03 12:19:41 2026 +0900"
      },
      "message": "[SPARK-56243][SS] Throw detailed error on malformed Kafka record timestamps\n\n### What changes were proposed in this pull request?\n\nThrow a detailed `KAFKA_MALFORMED_RECORD_TIMESTAMP` message when a record has a malformed or incorrect precision timestamp.\n\n### Why are the changes needed?\n\nSome users may use custom producers which set the timestamp to an incorrect precision level than what is expected by Kafka. Upon hitting this issue, they would get a generic arithmetic overflow error.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, it introduces a new error message, however there is no other behavioral change.\n\n### How was this patch tested?\n\nN/A\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55038 from kavpreetgrewal/SPARK-56243-timestamp-parsing.\n\nAuthored-by: Kavpreet Grewal \u003ckavpreet.grewal@databricks.com\u003e\nSigned-off-by: Jungtaek Lim \u003ckabhwan.opensource@gmail.com\u003e\n"
    },
    {
      "commit": "ccd4206c7e51144ee0e59454e1ceda1d349f6c9d",
      "tree": "6e7f047d1adc59f35606bca76e553d5e137f08d1",
      "parents": [
        "650b0a65a4e3c0d5b4ac6d7f604b423f012012ac"
      ],
      "author": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Thu Apr 02 17:33:05 2026 -0700"
      },
      "committer": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Thu Apr 02 17:33:05 2026 -0700"
      },
      "message": "[SPARK-56345][PYTHON][TESTS] Use `pd.Series.__name__` in Arrow UDF type-hint test\n\n### What changes were proposed in this pull request?\n\nThis PR updates `python/pyspark/sql/tests/arrow/test_arrow_udf_typehints.py` to make the negative Arrow UDF type-hint assertion build the expected error pattern from `pd.Series.__name__` instead of hard-coding `pandas.core.series.Series`.\n\nThe change is limited to the test expectation in `ArrowUDFTypeHintsTests.test_negative_with_pandas_udf`.\n\n### Why are the changes needed?\n\nThe previous assertion depended on a hard-coded fully qualified pandas type name. Building the expected pattern from `pd.Series.__name__` makes the test less coupled to a specific pandas type string representation while still checking that an unsupported pandas `Series` signature is rejected.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nUpdated the related test expectation.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55175 from ueshin/issues/SPARK-56345/typehints.\n\nAuthored-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\nSigned-off-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\n"
    },
    {
      "commit": "650b0a65a4e3c0d5b4ac6d7f604b423f012012ac",
      "tree": "45897747d9cd98fd4a16ea85e9388bd3c2dfeb8f",
      "parents": [
        "e9a348e30e38b126fa1b5359cabf93bc77760821"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Fri Apr 03 08:30:27 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Fri Apr 03 08:30:27 2026 +0800"
      },
      "message": "[SPARK-56338][INFRA] Support Maven mirrors for build\n\n### What changes were proposed in this pull request?\n\nSupport Maven mirrors when building Spark. Users can define an environment variable `MAVEN_MIRROR_URL` to use a mirror site for Maven packages.\n\n### Why are the changes needed?\n\nIt provides flexibility for users to direct the build system to a mirror/proxy site when the default ones are not available for networking reasons (unstable connection, access restriction etc).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nLocally tested with a mirror site.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55168 from gaogaotiantian/support-maven-mirror.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "e9a348e30e38b126fa1b5359cabf93bc77760821",
      "tree": "ace5f56e0f4aa8b798f72df48a0cc42d5fbfa56d",
      "parents": [
        "922de744ea23467a9ec287fa4e5e349ea7437e07"
      ],
      "author": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Thu Apr 02 13:03:48 2026 -0700"
      },
      "committer": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Thu Apr 02 13:03:48 2026 -0700"
      },
      "message": "[SPARK-56327][PYTHON][TESTS] Fix grouped map pandas tests for pandas 3\n\n### What changes were proposed in this pull request?\n\nThis PR updates `python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py` for pandas 3 behavior in grouped map pandas UDF tests.\n\nThe changes are:\n- update the expected boolean inversion in `test_supported_types` to use `~pdf.bool` on pandas 3 while keeping the existing behavior on older pandas versions\n- update several pandas-side expected-value paths to avoid grouping directly by the same in-DataFrame column on pandas 3\n- use copied groupers such as `pdf.id.copy()` so the grouped pandas input still keeps the grouping columns while preserving the original grouping semantics\n- add comments explaining why the pandas 3 branch uses copied groupers\n\n### Why are the changes needed?\n\nIn pandas 3, `GroupBy.apply` drops grouping columns when the grouping key is the same DataFrame column. These tests build expected results by applying the Python function to pandas-grouped data, so the previous expectations no longer match the grouped map input shape seen by Spark in cases that rely on grouping columns remaining present.\n\nThe boolean expectation also needs a pandas-3-specific branch because the old scalar-style inversion logic does not match the pandas 3 object being operated on in these tests.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nUpdated the related tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Codex (GPT-5)\n\nCloses #55146 from ueshin/issues/SPARK-56327/test_pandas_grouped_map.\n\nAuthored-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\nSigned-off-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\n"
    },
    {
      "commit": "922de744ea23467a9ec287fa4e5e349ea7437e07",
      "tree": "5217f769b878d95b91105588b171cb31590eba0f",
      "parents": [
        "671c65fd3d2f357e8e0d50d6298c4a70ac996ab7"
      ],
      "author": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Thu Apr 02 11:01:57 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Thu Apr 02 11:01:57 2026 -0700"
      },
      "message": "[SPARK-56323][SQL] Propagate ROW FORMAT / STORED AS to v2 catalog in CREATE TABLE LIKE\n\n### What changes were proposed in this pull request?\n\nDataSourceV2Strategy was discarding CreateTableLike.serdeInfo (the parsed ROW FORMAT / STORED AS clauses) with a wildcard pattern. CreateTableLikeExec never received them, so a STORED AS or ROW FORMAT override on a V2 catalog target was silently dropped.\n\nFix by:\n- Exposing CatalogV2Util.convertToProperties so it can be called from the core module (same conversion used by regular CREATE TABLE).\n- Adding serdeInfo: Option[SerdeInfo] to CreateTableLikeExec.\n- Passing serdeInfo through DataSourceV2Strategy instead of ignoring it.\n- Including CatalogV2Util.convertToProperties(serdeInfo) in targetProperties so hive.stored-as / hive.input-format / hive.output-format / hive.serde are set in the TableInfo passed to createTableLike.\n\nAdd two tests to CreateTableLikeSuite covering STORED AS and STORED AS INPUTFORMAT/OUTPUTFORMAT for a V2 catalog target.\n\n### Why are the changes needed?\n\nTo fix the issue of not propagating serde into to V2 CREATE TABLE LIKE.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nUnit tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #55142 from viirya/create-table-like-serde-fix.\n\nAuthored-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "671c65fd3d2f357e8e0d50d6298c4a70ac996ab7",
      "tree": "5daea33eb6b4478d8b2fc8ccea07bf95834f134c",
      "parents": [
        "73a272d013a41130e83296e3cc9a8bea89c94306"
      ],
      "author": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Thu Apr 02 19:02:49 2026 +0200"
      },
      "committer": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Thu Apr 02 19:02:49 2026 +0200"
      },
      "message": "[SPARK-56321][SQL] Fix `AnalysisException` when scan reports transform-based ordering via `SupportsReportOrdering`\n\n### What changes were proposed in this pull request?\n\n`V2ScanPartitioningAndOrdering.ordering` was calling `V2ExpressionUtils.toCatalystOrdering` without the `funCatalog` argument. This meant that function-based sort expressions reported by a data source via `SupportsReportOrdering` (e.g. `bucket(n, col)`) could not be resolved against the function catalog and caused an `AnalysisException` during query planning.\n\nThe fix passes `relation.funCatalog` as the third argument, consistent with how `toCatalystOpt` is already called in the `partitioning` rule of the same object.\n\n`InMemoryBaseTable` is also updated to use a new `InMemoryBatchScanWithOrdering` inner class (implementing `SupportsReportOrdering`) when the table is created with a non-empty ordering. This makes the test infrastructure correctly exercise the `SupportsReportOrdering` code path.\n\n### Why are the changes needed?\n\nWithout the function catalog, sort orders involving catalog functions reported by `SupportsReportOrdering` cannot be resolved, causing an `AnalysisException` during query planning instead of correctly recognizing the reported ordering.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. Data sources implementing `SupportsReportOrdering` with function-based sort expressions (e.g. `bucket`) that require the function catalog will now have those sort orders correctly recognized by Spark instead of throwing an `AnalysisException`.\n\n### How was this patch tested?\n\nA new test `SPARK-56321: scan with SupportsReportOrdering and function-based sort order` is added to `WriteDistributionAndOrderingSuite`. It creates a table with `bucket(4, \"id\")` partitioning and ordering, queries it, and asserts that the scan reports a non-empty `outputOrdering` with a `TransformExpression` — verifying the bucket transform was resolved correctly via the function catalog.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #55137 from peter-toth/SPARK-56321-use-function-catalog-in-v2scanpartitioningandordering.\n\nAuthored-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\nSigned-off-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\n"
    },
    {
      "commit": "73a272d013a41130e83296e3cc9a8bea89c94306",
      "tree": "dd308acc5d58e31c0c88448da419b75d3351329f",
      "parents": [
        "b580b4f437194806b73169848850f283d9d631a5"
      ],
      "author": {
        "name": "Yicong-Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Apr 02 19:57:56 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 19:57:56 2026 +0800"
      },
      "message": "[SPARK-56189][PYTHON] Refactor SQL_WINDOW_AGG_ARROW_UDF\n\n### What changes were proposed in this pull request?\n\nRefactor `SQL_WINDOW_AGG_ARROW_UDF` to be self-contained in `read_udfs()`, moving bounded/unbounded window logic from wrapper functions and the old mapper into a single execution block that uses `ArrowStreamGroupSerializer` as pure I/O.\n\nThis is a re-submission of #55123 which was reverted due to CI failure caused by using a non-existent `num_dfs` parameter on `ArrowStreamSerializer`. This version uses `ArrowStreamGroupSerializer` instead.\n\n### Why are the changes needed?\n\nPart of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests.\n\nASV micro-benchmarks with `repeat\u003d(3, 5)` show no regression:\n\n**SQL_WINDOW_AGG_ARROW_UDF (time)**\n\n| Scenario | UDF | Before | After | Change |\n|---|---|---|---|---|\n| few_groups_sm | sum | 8.73±0.06ms | 8.70±0.09ms | ~neutral |\n| few_groups_sm | mean_multi | 7.70±0.2ms | 7.82±0.1ms | ~neutral |\n| few_groups_lg | sum | 31.6±0.2ms | 31.6±0.5ms | ~neutral |\n| few_groups_lg | mean_multi | 29.6±0.2ms | 29.6±0.4ms | ~neutral |\n| many_groups_sm | sum | 244±4ms | 241±4ms | ~neutral |\n| many_groups_sm | mean_multi | 208±4ms | 206±2ms | ~neutral |\n| many_groups_lg | sum | 135±0.4ms | 134±0.3ms | ~neutral |\n| many_groups_lg | mean_multi | 124±2ms | 120±0.6ms | -3% |\n| wide_cols | sum | 72.7±0.1ms | 69.4±2ms | -5% |\n| wide_cols | mean_multi | 70.0±0.9ms | 69.9±0.3ms | ~neutral |\n\n**Peak memory**: No change (468M-506M for all scenarios).\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55153 from Yicong-Huang/refactor/window-agg-arrow-udf.\n\nAuthored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "b580b4f437194806b73169848850f283d9d631a5",
      "tree": "123f8f3e751dff43a5537b0813ef5d0f4312cc9b",
      "parents": [
        "3a226201f6cd6d7c1ccbced4690ad99e74229489"
      ],
      "author": {
        "name": "Tengfei Huang",
        "email": "tengfei.huang@databricks.com",
        "time": "Thu Apr 02 15:22:22 2026 +0800"
      },
      "committer": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Thu Apr 02 15:22:22 2026 +0800"
      },
      "message": "[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory\n\n### What changes were proposed in this pull request?\nThis PR adds a default `fetchSize` of 1000 for the PostgreSQL JDBC dialect to prevent loading entire tables into memory when no explicit `fetchSize` is specified by the user.\n\nThe changes include:\n\n1. **`JdbcDialect`**: Added `getFetchSize(options)` method that returns `options.fetchSize` by default. Dialects can override this to provide a sensible default when the user does not explicitly set the `fetchsize` option.\n2. **`PostgresDialect`**: Overrides `getFetchSize` to return 1000 when the user hasn\u0027t set `fetchsize`, and updates `beforeFetch` to use `getFetchSize()` for the `autoCommit` decision. Also logs an info message when the default is applied.\n3. **`AggregatedDialect`**: Delegates `getFetchSize` and `beforeFetch` to `dialects.head`, consistent with how other methods (e.g., `quoteIdentifier`, `getTruncateQuery`) are delegated.\n4. **`JDBCRDD`**: Uses `dialect.getFetchSize(options)` instead of `options.fetchSize` for `stmt.setFetchSize()`.\n\n### Why are the changes needed?\nBy default, the PostgreSQL JDBC driver loads **all rows** into memory when `fetchSize` is 0 (the Spark default). Without partitioning information, a single task may load the entire table into memory, which can easily cause executor OOM.\n\nUnlike most JDBC drivers, PostgreSQL requires both `fetchSize \u003e 0` **and** `autoCommit \u003d false` to enable cursor-based fetching (see [PostgreSQL JDBC documentation](https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor)). Setting a sensible default fetch size of 1000 enables cursor-based row batching automatically, preventing the driver from buffering the entire result set.\n\nUsers can still override the default by explicitly setting the `fetchsize` option (including `fetchsize\u003d0` to restore the old behavior).\n\n### Does this PR introduce _any_ user-facing change?\nNA\n\n### How was this patch tested?\nUTs added.\nManually verified the behavior for Postgres.\n\n### Was this patch authored or co-authored using generative AI tooling?\nGenerated-by: Claude Code v2.1.87\n\nCloses #55053 from ivoson/postgres-default-fetchsize.\n\nLead-authored-by: Tengfei Huang \u003ctengfei.huang@databricks.com\u003e\nCo-authored-by: Tengfei Huang \u003ctengfei.h@gmail.com\u003e\nSigned-off-by: Cheng Pan \u003cchengpan@apache.org\u003e\n"
    },
    {
      "commit": "3a226201f6cd6d7c1ccbced4690ad99e74229489",
      "tree": "c9e38f72d28ebad819853a42729e1069862a1123",
      "parents": [
        "0d8e031b82fa41a2a9306630c300dc1a29564fc7"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Thu Apr 02 11:44:09 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 11:44:09 2026 +0800"
      },
      "message": "[SPARK-56313][PYTHON][FOLLOWUP] Use old way to label generic for a class\n\n### What changes were proposed in this pull request?\n\nUse the old way to label generic for a class.\n\n### Why are the changes needed?\n\nThe new way is only supported starting 3.12.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nLocal mypy passed. Waiting for CI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55149 from gaogaotiantian/fix-rddsampler-generic-again.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "0d8e031b82fa41a2a9306630c300dc1a29564fc7",
      "tree": "81a83f5914ee19364c1e89361b780a7e9817ed4a",
      "parents": [
        "0af9722529ea3788f0a323f8dab3704fbee776eb"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 11:42:15 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 11:42:15 2026 +0800"
      },
      "message": "Revert \"[SPARK-56189][PYTHON] Refactor SQL_WINDOW_AGG_ARROW_UDF\"\n\nThis reverts commit 3433c386089ccc9c9e2b24a3faf0dcd8f0984b30.\n\nto restore CI https://github.com/apache/spark/actions/runs/23879531895/job/69630001459\n\nCloses #55152 from zhengruifeng/revert-56189.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "0af9722529ea3788f0a323f8dab3704fbee776eb",
      "tree": "020f154d8881a9bca6b93f9f425fea073de208cd",
      "parents": [
        "25e85b2e20180a4adb24be5f292dc51421ea3823"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Apr 02 09:53:29 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 02 09:53:29 2026 +0800"
      },
      "message": "[SPARK-56260][INFRA] Pin third-party GitHub Actions to commit SHA\n\n### What changes were proposed in this pull request?\n\nPin all third-party GitHub Actions in CI workflows to specific commit SHA versions instead of mutable version tags. A version comment is added alongside each SHA for maintainability.\n\nActions pinned:\n\n| Action | Before | Pinned SHA | Apache Allowlist |\n|---|---|---|---|\n| `bufbuild/buf-setup-action` | `v1` | `a47c93e0...` | `*: keep: true` |\n| `bufbuild/buf-lint-action` | `v1` | `06f9dd82...` | `*: keep: true` |\n| `bufbuild/buf-breaking-action` | `v1` | `c57b3d84...` | `*: keep: true` |\n| `codecov/codecov-action` | `v5` | `75cd1169...` | `*: keep: true` |\n| `medyagh/setup-minikube` | `v0.0.21` | `e9e035a8...` | `*: keep: true` |\n| `r-lib/actions/setup-r` | `v2` | `6f6e5bc6...` | `*: keep: true` |\n| `ruby/setup-ruby` | `v1` | `4dc28cf1...` | `*: keep: true` |\n| `test-summary/action` | `v2` | `31493c76...` | `*: keep: true` |\n\nAll actions are verified to be on the [Apache infrastructure-actions allowlist](https://github.com/apache/infrastructure-actions/blob/main/actions.yml).\n\n### Why are the changes needed?\n\nPinning actions to commit SHAs is a [security best practice recommended by GitHub](https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions#using-third-party-actions) to prevent supply chain attacks via compromised action tags.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nNo functional changes - only CI workflow action references are updated.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #55066 from Yicong-Huang/SPARK-56260-third-party.\n\nAuthored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "25e85b2e20180a4adb24be5f292dc51421ea3823",
      "tree": "cb2efdc48ea64ed8afdc506a97d2ec4c0df94061",
      "parents": [
        "3433c386089ccc9c9e2b24a3faf0dcd8f0984b30"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Apr 02 09:43:30 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 09:43:30 2026 +0800"
      },
      "message": "[SPARK-56222][PYTHON] Create ArrowStreamGroupSerializer and ArrowStreamCoGroupSerializer\n\n### What changes were proposed in this pull request?\n\nRefactors `ArrowStreamSerializer` by extracting group and cogroup loading logic into dedicated subclasses:\n\n```\nArrowStreamSerializer              (plain Arrow stream I/O)\n  ├── ArrowStreamGroupSerializer    (grouped loading, 1 df/group)\n  ├── ArrowStreamCoGroupSerializer  (cogrouped loading, 2 dfs/group)\n  ├── ArrowStreamUDFSerializer\n  ├── ArrowStreamPandasSerializer\n  └── ArrowStreamArrowUDFSerializer\n```\n\nKey changes:\n- `ArrowStreamSerializer`: simplified to only handle plain Arrow stream read/write. Removed `num_dfs` parameter.\n- `ArrowStreamGroupSerializer(ArrowStreamSerializer)`: new class that overrides `load_stream` with group-count protocol for single-dataframe groups.\n- `ArrowStreamCoGroupSerializer(ArrowStreamSerializer)`: new class that overrides `load_stream` with group-count protocol for two-dataframe cogroups.\n\n### Why are the changes needed?\n\nThis is part of the ongoing serializer simplification effort (SPARK-55384).  The previous `ArrowStreamSerializer` mixed plain stream I/O with group-count protocol logic via a `num_dfs` parameter and a multi-purpose `_load_group_dataframes` method. This made the return types ambiguous — callers couldn\u0027t tell from the type signature whether they\u0027d get `pa.RecordBatch`, `Iterator[pa.RecordBatch]`, or `Tuple[...]`. By splitting group and cogroup into separate classes, each `load_stream` has a clear, precise return type, improving readability and enabling better static analysis.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55026 from Yicong-Huang/SPARK-56222.\n\nLead-authored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nCo-authored-by: Yicong-Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "3433c386089ccc9c9e2b24a3faf0dcd8f0984b30",
      "tree": "fea40659d8e836ebd83c554829e92ef5c40b3b39",
      "parents": [
        "d498b067f33ab4973d1c9a382fe6b31307ad3610"
      ],
      "author": {
        "name": "Yicong Huang",
        "email": "17627829+Yicong-Huang@users.noreply.github.com",
        "time": "Thu Apr 02 09:41:13 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 09:41:13 2026 +0800"
      },
      "message": "[SPARK-56189][PYTHON] Refactor SQL_WINDOW_AGG_ARROW_UDF\n\n### What changes were proposed in this pull request?\n\nRefactor `SQL_WINDOW_AGG_ARROW_UDF` to be self-contained in `read_udfs()`, moving bounded/unbounded window logic from wrapper functions and the old mapper into a single execution block that uses `ArrowStreamSerializer` as pure I/O.\n\n### Why are the changes needed?\n\nPart of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388).\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55123 from Yicong-Huang/refactor/window-agg-arrow-udf.\n\nAuthored-by: Yicong Huang \u003c17627829+Yicong-Huang@users.noreply.github.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "d498b067f33ab4973d1c9a382fe6b31307ad3610",
      "tree": "6315684c28aa53e54f1fb7eafe2b4daf9e1cf709",
      "parents": [
        "989df5e20d0dc1da749efe5e4f9a9e00b574e521"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 09:36:17 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Thu Apr 02 09:36:17 2026 +0800"
      },
      "message": "[SPARK-56314][SQL][TESTS] Avoid uncessary RDD-\u003eDataFrame conversion in `SQLTestData`\n\n### What changes were proposed in this pull request?\nAvoid uncessary RDD-\u003eDataFrame conversion in `SQLTestData`\n\n### Why are the changes needed?\n1, several datasets are RDDs, but they need to be converted to DataFrame/Rows in callsites;\n2, Make all datasets are DataFrames, so that they can be reused in Connect Tests in the future.\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nCI\n\n### Was this patch authored or co-authored using generative AI tooling?\nCo-authored-by: Claude Code (Opus 4.6)\n\nCloses #55126 from zhengruifeng/sql_data_rdd.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "989df5e20d0dc1da749efe5e4f9a9e00b574e521",
      "tree": "35e1e23edc614408ecf72e890754914cf34bf334",
      "parents": [
        "00b9451f480934375c5cadc723d1a0b11da6a34b"
      ],
      "author": {
        "name": "Akash Nayar",
        "email": "akashknayar5@gmail.com",
        "time": "Thu Apr 02 09:27:26 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 02 09:27:26 2026 +0800"
      },
      "message": "[SPARK-56306][SQL] Fix collation-aware PIVOT\n\n### What changes were proposed in this pull request?\n\nThis PR fixes PivotFirst to respect pivot value collation. Below is a minimalistic repro for the bug:\n\n```\nfrom pyspark.sql import SparkSession\n\nspark \u003d SparkSession.builder.appName(\"PivotCollationTest\").getOrCreate()\n\ndata \u003d [(1, \"SALES\", 100), (1, \"sales\", 50)]\ndf \u003d spark.createDataFrame(data, [\"emp_id\", \"dept\", \"amount\"])\ndf.createOrReplaceTempView(\"df\")\n\nCOLLATED \u003d \"(SELECT emp_id, COLLATE(dept, \u0027UTF8_LCASE\u0027) AS dept, amount FROM df)\"\n\nprint(\"Pivot with \u0027SALES\u0027 - expects 150, gets 150\")\nspark.sql(f\"\"\"\n    SELECT * FROM {COLLATED}\n    PIVOT (SUM(amount) FOR dept IN (\u0027SALES\u0027))\n\"\"\").show()\n\nprint(\"Pivot with \u0027sales\u0027 - expects 150, gets NULL\")\nspark.sql(f\"\"\"\n    SELECT * FROM {COLLATED}\n    PIVOT (SUM(amount) FOR dept IN (\u0027sales\u0027))\n\"\"\").show()\n\nspark.stop()\n```\n\nGiven the `UTF8_LCASE` collation, both PIVOT queries should return a `SUM(amount)` of 150. However, since Scala\u0027s ImmutableHashMap does a byte-level comparison of the keys in `pivotIndex` with incoming `pivotColumn` values, `SUM(amount)` for `\u0027SALES\u0027` returns `150` and `NULL` for `sales`. This is because the HashMap does a byte-by-byte comparison between `\u0027sales\u0027` and the resolved group-key (`\u0027SALES\u0027`).\n\nThe fix is to fall back to `TreeMap` (already used on nested types) for non-binary-stable string types (e.g., `UTF8_LCASE`, `UNICODE_CI`).\n\n### Why are the changes needed?\n\nThere is currently a correctness bug. `PIVOT` compares atomic strings at the byte level, completely ignoring specified collation and silently returning `NULL`.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, collation-aware pivot values will now adhere to the specified collation. This will result in fewer false negative pivot column value matches and more meaningful results.\n\n### How was this patch tested?\n\nUnit tests testing `UTF8_LCASE` and `UNICODE_CI` collation with nulls.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: claude-4.6-opus-high\n\nCloses #55109 from aknayar/collation-aware-pivot.\n\nAuthored-by: Akash Nayar \u003cakashknayar5@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "00b9451f480934375c5cadc723d1a0b11da6a34b",
      "tree": "be40074a58611049b165a4a1ae9dfae93dab2ec9",
      "parents": [
        "463a188f804f17177d62f46954598a5b2f472be4"
      ],
      "author": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Thu Apr 02 09:24:32 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 02 09:24:32 2026 +0800"
      },
      "message": "[SPARK-55981][SQL] Allow Geo Types with SRID\u0027s from the pre-built registry\n\n### What changes were proposed in this pull request?\nThis is based on PR: https://github.com/apache/spark/pull/54571, which pre-compiles a list of SRID\u0027s from Proj.\n\nOnce that is in, we can activate these SRID\u0027s in the Geometry and Geography types\n\nWe also have some overrides to support OGC standard as well.\n\nNote: some of these changes were from uros-db initially in https://github.com/apache/spark/pull/54543, thanks!\n\n### Why are the changes needed?\nSupport standard SRID\u0027s for new Geo types\n\n### Does this PR introduce _any_ user-facing change?\nGeo types not released yet\n\n### How was this patch tested?\nAdd unit tests\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes\n\nCloses #54780 from szehon-ho/srid.\n\nAuthored-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "463a188f804f17177d62f46954598a5b2f472be4",
      "tree": "e2fcdf94e94b9a10414e022eb3191956558fcf6c",
      "parents": [
        "e0bc7aa7d93d8542db43e7a52478196d895ba0fc"
      ],
      "author": {
        "name": "Szehon Ho",
        "email": "szehon.apache@gmail.com",
        "time": "Thu Apr 02 09:22:33 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Thu Apr 02 09:22:33 2026 +0800"
      },
      "message": "[SPARK-56190][SQL] Support nested partition columns for DSV2 PartitionPredicate\n\n### What changes were proposed in this pull request?\nSupported nested columns.\n1. Pass an enhanced partitionSchema to the pushdownFilters()\n2. Add support to \u0027flatten\u0027 the provided filters (GetStruct(AttributeReference(\"parent\"), \"child\")) into a schema understood by the partition predicate and pushdown.\n\n### Why are the changes needed?\nDSV2 connectors support nested struct fields as partition fields.\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nAdd unit tests to DSV2EnhancedPartitionFilterSuite.\n\n### Was this patch authored or co-authored using generative AI tooling?\nYes, cursor with hand refactoring\n\nCloses #54995 from szehon-ho/nested_partition_filter.\n\nAuthored-by: Szehon Ho \u003cszehon.apache@gmail.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "e0bc7aa7d93d8542db43e7a52478196d895ba0fc",
      "tree": "3fadf77ee32bd5d0a2bcb4aac068ef22cc1f77ed",
      "parents": [
        "e75d6fbbae46a77a2876f81e3704a7164de02c52"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Apr 01 15:38:38 2026 -0700"
      },
      "committer": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Wed Apr 01 15:38:38 2026 -0700"
      },
      "message": "[SPARK-56313][PYTHON][FOLLOWUP] Remove the generic for rddsampler methods\n\n### What changes were proposed in this pull request?\n\nRemove the generic (`[T]`) for rddsampler methods\n\n### Why are the changes needed?\n\nIt\u0027s supported starting from 3.12 but we support python since 3.10.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nmypy passed locally.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55144 from gaogaotiantian/fix-rddsampler-generic.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\n"
    },
    {
      "commit": "e75d6fbbae46a77a2876f81e3704a7164de02c52",
      "tree": "c28dda0dc728b223884143b42ba0a848b46dd386",
      "parents": [
        "fec28040907b511f00a60da56a02261a7efed0ef"
      ],
      "author": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Wed Apr 01 11:15:39 2026 -0700"
      },
      "committer": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Wed Apr 01 11:15:39 2026 -0700"
      },
      "message": "[SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna\u003dFalse behavior for pandas 2\n\n### What changes were proposed in this pull request?\n\nThis is a follow-up of apache/spark#55021.\n\nThis PR updates pandas-on-Spark `GroupBy.idxmax` and `GroupBy.idxmin` for `skipna\u003dFalse` to keep the legacy behavior for all pandas 2 versions.\n\nWith this change:\n\n- pandas `\u003c 3.0.0` keeps the legacy `idxmax` and `idxmin` result for `skipna\u003dFalse`\n- pandas `\u003e\u003d 3.0.0` keeps the existing error behavior for NA-containing input\n\nThis PR also updates the related test in `python/pyspark/pandas/tests/groupby/test_index.py` to validate the pandas 2 behavior directly instead of relying on pandas 2.2 and 2.3 having the same result.\n\n### Why are the changes needed?\n\nThe previous fix split pandas 2.2 and pandas 2.3 behavior for `GroupBy.idxmax(skipna\u003dFalse)` and `GroupBy.idxmin(skipna\u003dFalse)` on NA-containing input.\n\nFor example:\n\n```python\npdf \u003d pd.DataFrame({\"a\": [1, 1, 2, 2], \"b\": [1, None, 3, 4], \"c\": [4, 3, 2, 1]})\npdf.groupby([\"a\"]).idxmax(skipna\u003dFalse).sort_index()\n```\n\nIn pandas 2.2, this returns:\n\n```python\n   b  c\na\n1  0  0\n2  3  2\n```\n\nIn pandas 2.3, this returns:\n\n```python\n     b  c\na\n1  NaN  0\n2  3.0  2\n```\n\nIn pandas 3, this raises `ValueError`.\n\nInstead of matching the pandas 2.2 / 2.3 difference, this PR keeps the legacy pandas 2 behavior across all pandas 2 environments and continues to follow the pandas 3 behavior in pandas 3 environments.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes.\n\nIn pandas-on-Spark with pandas 2.x, `GroupBy.idxmax(skipna\u003dFalse)` and `GroupBy.idxmin(skipna\u003dFalse)` on NA-containing groups now consistently keep the legacy result behavior instead of varying with the installed pandas 2 version.\n\nFor pandas 3, behavior is unchanged from the current implementation.\n\n### How was this patch tested?\n\nRan the related pandas-on-Spark regression test in three environments:\n\n- pandas 2.2: `GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na`\n- pandas 2.3: `GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na`\n- pandas 3.0: `GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na`\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Codex (GPT-5)\n\nCloses #55121 from ueshin/issues/SPARK-56219/pd2.2.\n\nAuthored-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\nSigned-off-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\n"
    },
    {
      "commit": "fec28040907b511f00a60da56a02261a7efed0ef",
      "tree": "5585c59a8f8d7b4547a225e8ee59e391dd0885bc",
      "parents": [
        "0f0c0e25da8789ee1ecac5c1662e48bb0f0ecceb"
      ],
      "author": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Wed Apr 01 11:13:39 2026 -0700"
      },
      "committer": {
        "name": "Takuya Ueshin",
        "email": "ueshin@databricks.com",
        "time": "Wed Apr 01 11:13:39 2026 -0700"
      },
      "message": "[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas\n\n### What changes were proposed in this pull request?\n\nThis PR updates PySpark `DataFrame.toPandas()` dtype correction for pandas 3.x.\n\nIn `python/pyspark/sql/pandas/types.py`, `StringType` is mapped to `pd.StringDtype(na_value\u003dnp.nan)` when running with pandas 3.x instead of leaving the column as `object`. The `TimestampType` conversion path is also adjusted so that after timezone normalization the series is cast back to the expected pandas dtype only for pandas 3.x.\n\nThe related assertions in `python/pyspark/sql/tests/test_collection.py` are updated to check pandas-version-specific dtypes for string, datetime, and timedelta columns, and the Arrow on/off loops now use `subTest(...)` for clearer failures.\n\nSince the pandas 3 string dtype changes also affect downstream restoration behavior, `python/pyspark/pandas/data_type_ops/string_ops.py` now restores missing string values as `None` before casting back to a non-string dtype. The Spark Connect coverage in `python/pyspark/sql/tests/connect/test_connect_dataframe_property.py` is also updated to reflect the pandas 3 string dtype expectation.\n\n### Why are the changes needed?\n\npandas 3 changes dtype behavior for strings and datetime-related values compared to earlier pandas versions. The existing `toPandas()` logic and related tests still assume `object` string columns and older datetime/timedelta dtype expectations in places where pandas 3 now returns string extension dtypes and microsecond-resolution timestamp/timedelta dtypes.\n\nWithout these changes, `DataFrame.toPandas()` does not preserve pandas 3 string dtype behavior correctly, and some tests and pandas-on-Spark restoration paths still assume the pre-pandas-3 representation.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes.\n\nWith pandas 3.x, `DataFrame.toPandas()` can now return Spark string columns as pandas `StringDtype(na_value\u003dnp.nan)` instead of `object`, and timestamp/timedelta columns follow the pandas 3 dtype expectations more consistently after conversion.\n\nThis is a user-facing behavior change compared to released versions that still return the older dtype behavior.\n\n### How was this patch tested?\n\nUpdated the related tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55118 from ueshin/issues/SPARK-56310/dtypes.\n\nAuthored-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\nSigned-off-by: Takuya Ueshin \u003cueshin@databricks.com\u003e\n"
    },
    {
      "commit": "0f0c0e25da8789ee1ecac5c1662e48bb0f0ecceb",
      "tree": "9160a9b855b065c6b0f9eb787af6d2eb4149e74d",
      "parents": [
        "3e7e12652b281ac091f89b6ce3e82354cf9cd78f"
      ],
      "author": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Wed Apr 01 10:41:53 2026 -0700"
      },
      "committer": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Wed Apr 01 10:41:53 2026 -0700"
      },
      "message": "[SPARK-56296][SQL] Pivot createTableLike to pass full TableInfo including schema, partitioning, constraints, and owner\n\n### What changes were proposed in this pull request?\n\nPreviously createTableLike(ident, sourceTable, userSpecifiedOverrides) only passed user-specified TBLPROPERTIES to the connector via TableInfo, requiring connectors to call CurrentUserContext.getCurrentUser (a Catalyst internal) to set the owner.\n\nWe should not expose Catalyst internal to connectors. But putting owner to the TableInfo means that we will have to rename userProvidedOverrides to something else, meaning it no longer contains only user overrides.\n\nThis change pivots to createTableLike(ident, tableInfo, sourceTable) where tableInfo contains all explicit information for the new table:\n- columns and partitioning copied from the source\n- constraints copied from the source\n- user-specified TBLPROPERTIES, LOCATION, and USING provider (if given)\n- PROP_OWNER set to the current user\n\nSource table properties are intentionally excluded from tableInfo; connectors receive sourceTable to clone any format-specific or custom state they need. This matches the pattern used by REPLACE TABLE in DSv2.\n\nUpdate InMemoryTableCatalog, CatalogSuite, and CreateTableLikeSuite accordingly.\n\n### Why are the changes needed?\n\nThis pivot is necessary to keep a balance between API consistency and internal exposure.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nUnit tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #55101 from viirya/create-table-like-prop-owner-fix.\n\nAuthored-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\nSigned-off-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\n"
    },
    {
      "commit": "3e7e12652b281ac091f89b6ce3e82354cf9cd78f",
      "tree": "a8f61154dda5189ff458bff6d4982894373e321f",
      "parents": [
        "dd492dd85b2379e1cbb5e5161763da1621a062a2"
      ],
      "author": {
        "name": "Thang Long VU",
        "email": "long.vu@databricks.com",
        "time": "Wed Apr 01 20:18:15 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Apr 01 20:18:15 2026 +0800"
      },
      "message": "[SPARK-56001][SQL] Add INSERT INTO ... REPLACE ON/USING syntax\n\n### What changes were proposed in this pull request?\n\nThis PR introduces two new SQL syntaxes for the `INSERT` command (think `JOIN ON/USING`and `INSERT REPLACE WHERE`):\n- `INSERT INTO ... REPLACE ON \u003ccondition\u003e` — replaces rows matching a condition\n- `INSERT INTO ... REPLACE USING (\u003ccolumns\u003e)` — replaces rows based on matching column values\n\nSimilar to the [INSERT WITH SCHEMA EVOLUTION PR](https://github.com/apache/spark/pull/53732), Spark is only responsible for recognizing these syntaxes. Since no table format in open-source Spark implements these operations yet, users will receive an unsupported error if they try to use them.\n\n### Why are the changes needed?\n\n`INSERT INTO ... REPLACE ON/USING` provides SQL syntax for atomically replacing a subset of rows in a table. This builds on the existing `INSERT INTO ... REPLACE WHERE` syntax ([SPARK-40956](https://issues.apache.org/jira/browse/SPARK-40956) and extends it with more flexible matching semantics:\n- `REPLACE ON` allows matching via arbitrary boolean expressions (e.g., `t.id \u003d s.id`)\n- `REPLACE USING` allows matching via a list of column names\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. Two new SQL syntaxes are recognized by the parser:\n- `INSERT [WITH SCHEMA EVOLUTION] INTO table AS alias [BY NAME] REPLACE ON condition query`\n- `INSERT [WITH SCHEMA EVOLUTION] INTO table [BY NAME] REPLACE USING (column_list) query`\n\nBoth currently throw `UNSUPPORTED_INSERT_REPLACE_ON_OR_USING`.\n\n### How was this patch tested?\n\n- DDLParserSuite: Parser tests for REPLACE USING, REPLACE ON, and combined WITH SCHEMA EVOLUTION\n- PlanResolutionSuite: V2 table unsupported error tests\n- InsertSuite (core): V1 table unsupported error tests\n- InsertSuite (hive): Hive table unsupported error tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nYes.\n\nCloses #54722 from longvu-db/insert-replace-on-using.\n\nAuthored-by: Thang Long VU \u003clong.vu@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "dd492dd85b2379e1cbb5e5161763da1621a062a2",
      "tree": "cd1c4c6be656e4c4ce4dc81d77e640b183ee2cbb",
      "parents": [
        "9518075a01ff264226cefbfe19ff7239c9ecaf33"
      ],
      "author": {
        "name": "Johan Lasperas",
        "email": "johan.lasperas@databricks.com",
        "time": "Wed Apr 01 18:28:04 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Wed Apr 01 18:28:04 2026 +0800"
      },
      "message": "[SPARK-56054][SQL] Fix aliased nested fields ignored for schema evolution in MERGE\n\n### What changes were proposed in this pull request?\nFixes a small bug in the [initial implementation of schema evolution in MERGE](https://github.com/apache/spark/pull/51698): when a nested struct fields present in the source is used aliased in a direct assignment clause in MERGE, it is not correctly considered for schema evolution.\n\nExample:\n```\nsource.mergeInto(\"target\", condition)\n  .whenMatched()\n  update(Map(\"info\" -\u003e col(\"source.info\").as(\"info\")))\n  .withSchemaEvolution()\n  .merge()\n```\nwhere `info` is a struct that contains an extra field in the source compared to the target. Without this fix, the extra field is ignored during schema evolution and isn\u0027t added to the target table.\nWith the fix, it is correctly added to the target schema.\n\n### How was this patch tested?\nAdded a test that reproduces the issue.\n\nCloses #54891 from johanl-db/dsv2-schema-evolution-merge-alias.\n\nAuthored-by: Johan Lasperas \u003cjohan.lasperas@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "9518075a01ff264226cefbfe19ff7239c9ecaf33",
      "tree": "cac15226850c2417dbe2a14e5db3617e69175374",
      "parents": [
        "d1916e311dfaa83df04690ee22d0a8197b89eda5"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 01 14:35:25 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 01 14:35:25 2026 +0800"
      },
      "message": "[SPARK-54938][PYTHON][TEST][FOLLOW-UP] Fix `test_pyarrow_array_type_inference` for pandas \u003e\u003d 3\n\n### What changes were proposed in this pull request?\npandas 3.x changed default string dtype to use pyarrow-backed storage, causing pa.array() to infer large_string instead of string for string Series. Conditionally expect large_string on pandas \u003e\u003d 3.\n\n### Why are the changes needed?\nto resolve failure in https://github.com/apache/spark/actions/runs/23819581811/job/69428367355\n\n### Does this PR introduce _any_ user-facing change?\nNo, test-only\n\n### How was this patch tested?\nmanually check\n\npandas\u003d\u003d3.0.1\n```\nIn [3]: import pyarrow as pa\n\nIn [4]: import pandas as pd\n\nIn [5]: ser \u003d pd.Series([\"a\", \"b\", \"c\"], dtype\u003dpd.StringDtype())\n\nIn [6]: pa.array(ser)\nOut[6]:\n\u003cpyarrow.lib.LargeStringArray object at 0x103455d80\u003e\n[\n  \"a\",\n  \"b\",\n  \"c\"\n]\n\nIn [7]: pa.array(ser, pa.string())\nOut[7]:\n\u003cpyarrow.lib.StringArray object at 0x1095546a0\u003e\n[\n  \"a\",\n  \"b\",\n  \"c\"\n]\n```\n\npandas\u003d\u003d2.3.3\n```\nIn [7]: ser \u003d pd.Series([\"a\", \"b\", \"c\"], dtype\u003dpd.StringDtype())\n\nIn [8]: pa.array(ser)\nOut[8]:\n\u003cpyarrow.lib.StringArray object at 0x10ae16620\u003e\n[\n  \"a\",\n  \"b\",\n  \"c\"\n]\n\nIn [9]: pa.array(ser, pa.string())\nOut[9]:\n\u003cpyarrow.lib.StringArray object at 0x10ae14a00\u003e\n[\n  \"a\",\n  \"b\",\n  \"c\"\n]\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\nCo-authored-by: Claude code (Opus 4.6)\n\nCloses #55125 from zhengruifeng/fix_inference_p3.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "d1916e311dfaa83df04690ee22d0a8197b89eda5",
      "tree": "a20244c0560f0a09ad33e3973c2079c8cee61b5e",
      "parents": [
        "1dd26f9218491ff538b13f73ec39b0c7920009f7"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Apr 01 11:46:18 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 01 11:46:18 2026 +0800"
      },
      "message": "[SPARK-56313][PYTHON] Add type hint for rddsampler.py\n\n### What changes were proposed in this pull request?\n\nAdd type hints for rddsampler.py\n\n### Why are the changes needed?\n\nPolish the type annotation for pyspark.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55122 from gaogaotiantian/rddsampler-type-hint.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "1dd26f9218491ff538b13f73ec39b0c7920009f7",
      "tree": "0be065bc3649dc727061b86448d3c9b5a826ec98",
      "parents": [
        "384f543e571047953f942c1cf20718ef1c3c155a"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Apr 01 11:44:19 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 01 11:44:19 2026 +0800"
      },
      "message": "[SPARK-56271][PYTHON] Fix type hint and remove unused method for _globals.py\n\n### What changes were proposed in this pull request?\n\n* Add type hint for `_globals.py` and remove the ignore config in mypy\n* Remove the `__reduce__` method that is only needed for python2\n\n### Why are the changes needed?\n\nType hint is trivial and we can just do it. `__reduce__` is a legacy thing just for python2 which has been dead for years.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nmypy passed locally.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55068 from gaogaotiantian/global-type-hint.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "384f543e571047953f942c1cf20718ef1c3c155a",
      "tree": "ca672be05d437f149d4190cb59f20ecb5f8ae737",
      "parents": [
        "89cb6925fb29b513a5834b815f833cd8f810517b"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Apr 01 09:23:20 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 01 09:23:20 2026 +0800"
      },
      "message": "[SPARK-56311][PYTHON] Add type hints for daemon.py\n\n### What changes were proposed in this pull request?\n\n* Add type hints for daemon.py\n* `numbers.Integral` is replaced with `int`\n\n### Why are the changes needed?\n\nPart of the effort to polish type annotations for pyspark.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55119 from gaogaotiantian/daemon-type-hint.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "89cb6925fb29b513a5834b815f833cd8f810517b",
      "tree": "fecab4787760ae322e77b66d469d63c6e38c5770",
      "parents": [
        "d580b6531aeecaef8d5abb99e8da1c10484bafa1"
      ],
      "author": {
        "name": "Tian Gao",
        "email": "gaogaotiantian@hotmail.com",
        "time": "Wed Apr 01 08:43:12 2026 +0800"
      },
      "committer": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Wed Apr 01 08:43:12 2026 +0800"
      },
      "message": "[SPARK-56123][PYTHON][FOLLOWUP] Avoid using concat_batches for old version of pyarrow\n\n### What changes were proposed in this pull request?\n\nAvoid using `pa.concat_batches` before 19.0.0 because the method does not exist.\n\n### Why are the changes needed?\n\nIt\u0027s breaking CI https://github.com/apache/spark/actions/runs/23790464731/job/69324715152\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #55117 from gaogaotiantian/fix-pa-concat.\n\nAuthored-by: Tian Gao \u003cgaogaotiantian@hotmail.com\u003e\nSigned-off-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\n"
    },
    {
      "commit": "d580b6531aeecaef8d5abb99e8da1c10484bafa1",
      "tree": "1fb33c5fc3f70dee4c6fbbc5939978db36a63856",
      "parents": [
        "a4719590acdb9866e5f33cf7659bccf89fcd4080"
      ],
      "author": {
        "name": "Liang-Chi Hsieh",
        "email": "viirya@gmail.com",
        "time": "Tue Mar 31 13:26:37 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 13:26:37 2026 -0700"
      },
      "message": "[SPARK-49543][SQL] Add SHOW COLLATIONS command\n\n### What changes were proposed in this pull request?\n\nAdd SHOW COLLATIONS SQL syntax to list all Spark built-in collations. Supports optional LIKE pattern filtering (e.g. SHOW COLLATIONS LIKE \u0027UNICODE*\u0027).\n\nOutput schema: NAME, LANGUAGE, COUNTRY, ACCENT_SENSITIVITY, CASE_SENSITIVITY, PAD_ATTRIBUTE, ICU_VERSION — matching the existing collations() TVF but without the constant CATALOG/SCHEMA columns.\n\nImplementation follows the ShowCatalogsCommand pattern as collations are engine-global and not tied to any catalog or namespace.\n\n### Why are the changes needed?\n\nSHOW COLLATIONS is a SQL command supported by MySQL and its derivatives (MariaDB, TiDB) for listing available collations. Spark currently only exposes this information via a table-valued function (SELECT * FROM collations()), which is inconsistent with how other catalog objects are queried (SHOW CATALOGS, SHOW TABLES, etc.) and unfamiliar to users coming from MySQL-compatible databases. This change adds a more intuitive SQL syntax consistent with Spark\u0027s existing SHOW command family.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, this adds `SHOW COLLATIONS` command.\n\n### How was this patch tested?\n\nUnit tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #55099 from viirya/SPARK-49543-show-collations.\n\nAuthored-by: Liang-Chi Hsieh \u003cviirya@gmail.com\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "a4719590acdb9866e5f33cf7659bccf89fcd4080",
      "tree": "7c4e4852739006cbdc80ded653dcdee4f79bf99c",
      "parents": [
        "e6feb272edc4152cb07f43c80a316355fc5b9d3a"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 13:14:04 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 13:14:04 2026 -0700"
      },
      "message": "[SPARK-56307][BUILD] Upgrade `log4j` to 2.25.4\n\n### What changes were proposed in this pull request?\n\nThis PR upgrades Apache Log4j to 2.25.4.\n\n### Why are the changes needed?\n\nTo bring in the latest bug fixes from Log4j 2.25.4, which includes fixes for configuration inconsistencies and formatting issues across several layouts.\n- https://github.com/apache/logging-log4j2/releases/tag/rel%2F2.25.4\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass the CIs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (Claude Opus 4.6)\n\nCloses #55114 from dongjoon-hyun/SPARK-56307.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "e6feb272edc4152cb07f43c80a316355fc5b9d3a",
      "tree": "77addc8dcc79cc2492c3dfd01d2789316a90466f",
      "parents": [
        "e7eceb740b63f19433fb8501dc2201139f11a8ea"
      ],
      "author": {
        "name": "Gurpreet Nanda",
        "email": "gurpreet.nanda@databricks.com",
        "time": "Tue Mar 31 09:57:41 2026 -0700"
      },
      "committer": {
        "name": "Anish Shrigondekar",
        "email": "anish.shrigondekar@databricks.com",
        "time": "Tue Mar 31 09:57:41 2026 -0700"
      },
      "message": "[SPARK-51988][SS] Do file checksum verification on read for RocksDB zip file\n\n### What changes were proposed in this pull request?\n\nRocksDB checkpoint zip files were being read without checksum verification, even\nwhen `fileChecksumEnabled \u003d true`. This PR fixes that for checkpoint v2 by routing\nzip reads through `CheckpointFileManager`, which verifies the checksum on\nclose.\n\nNOTE: Checkpoint v1 is excluded due to filename collisions that could cause false\nverification failures since the same filename may be used for concurrent uploads which contain different bytes.\n\n### Why are the changes needed?\n\nCorrupted checkpoint zip files were silently accepted, which could cause incorrect\nstate to be loaded without any error.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. When `fileChecksumEnabled \u003d true`, loading a corrupted v2 checkpoint zip now throws\n`CHECKPOINT_FILE_CHECKSUM_VERIFICATION_FAILED` instead of silently succeeding.\n\n### How was this patch tested?\n\n  New unit tests in `RocksDBSuite` and `UtilsSuite`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n  Generated-by: Claude Code 2.1.58\n\nCloses #54493 from gnanda/stack/SPARK-51988.\n\nAuthored-by: Gurpreet Nanda \u003cgurpreet.nanda@databricks.com\u003e\nSigned-off-by: Anish Shrigondekar \u003canish.shrigondekar@databricks.com\u003e\n"
    },
    {
      "commit": "e7eceb740b63f19433fb8501dc2201139f11a8ea",
      "tree": "f5e1d3cda92c6a91f672e71eefe90801696bf2b5",
      "parents": [
        "224f30fa8a1b4912f161126df82dafccfae6d5a1"
      ],
      "author": {
        "name": "DenineLu",
        "email": "deninelu@163.com",
        "time": "Tue Mar 31 09:01:38 2026 -0700"
      },
      "committer": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Tue Mar 31 09:01:38 2026 -0700"
      },
      "message": "[SPARK-56235][CORE] Add reverse index in TaskSetManager to avoid O(N) scans in executorLost\n\n### What changes were proposed in this pull request?\nThis PR adds a reverse index `executorIdToTaskIds: HashMap[String, OpenHashSet[Long]]` in `TaskSetManager` to efficiently look up tasks by executor ID, replacing O(N) full scans over `taskInfos` in `executorLost()` with O(K) direct lookups (K \u003d tasks per executor).\n\n**Changes:**\n- Added `executorIdToTaskIds` field in `TaskSetManager`, populated at task launch in `prepareLaunchingTask()`\n- Rewrote the two loops in `executorLost()` to iterate only over tasks on the lost executor via the reverse index\n\n### Why are the changes needed?\n\nIn a production Spark job (Spark 3.5.1, dynamic allocation enabled, disable shuffle tracking) with a single stage containing 5 million tasks, we observed that near the end of the stage, the Spark UI showed the last few tasks stuck in \"RUNNING\" state for **1-2 hours**.\nHowever, checking executor thread dumps confirmed that **no task threads were actually running** — the tasks had already completed on the executor side, but the Driver had not processed their completion messages.\n\u003cimg width\u003d\"2457\" height\u003d\"884\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/d001ca27-c297-4777-8b00-25960719570b\" /\u003e\n\u003cimg width\u003d\"4576\" height\u003d\"1854\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/503c687c-00a0-41cf-b686-a9ce6f4deaa2\" /\u003e\n\nCPU profiling of the Driver JVM (5-minute snapshot) revealed that `TaskSetManager.executorLost()` was consuming **99.5%** of all CPU samples, due to O(N) full scans over the `taskInfos` HashMap (N \u003d 5,000,000 entries).\n\u003cimg width\u003d\"5086\" height\u003d\"1804\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/3a2e76dd-b083-400a-ac99-6296e1abbfff\" /\u003e\n\nThe `executorLost()` method scans the **entire** `taskInfos` map to find tasks on the lost executor:\n\n```scala\n// Before: O(N) — scans ALL task attempts to find those on the lost executor\nfor ((tid, info) \u003c- taskInfos if info.executorId \u003d\u003d execId) { ... }\n```\n\nThe blocking is amplified when the following conditions are present:\n\n1. **Long-tail tasks at stage end** — a few remaining tasks take longer than `spark.dynamicAllocation.executorIdleTimeout` (default 60s) to complete. Most executors have finished their work and sit idle, while these slow tasks are still running.\n2. **Batch executor removal** — after the idle timeout is triggered, a large number of RemoveExecutor messages continue to be sent to the DriverEndpoint RPC queue.\n\nAfter this PR, the same workload (5M tasks, 10K executors, dynamic allocation enabled) no longer exhibits the stall. Execution time reduced from **117 minutes to 45 minutes**. At the end of the Stage, the optimization has eliminated the previous `executorLost` hotspot issue.\n\n| | Before | After |\n|---|---|---|\n| **Job Timeline** | \u003cimg width\u003d\"1936\" height\u003d\"353\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/3f7073e1-1c11-4c52-940a-bf5d76939c12\" /\u003e | \u003cimg width\u003d\"1940\" height\u003d\"362\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/01ff59fd-d07d-417b-849b-c0a6519daac5\" /\u003e |\n| **Driver CPU Top Threads (Stage Tail)** | \u003cimg width\u003d\"2758\" height\u003d\"1450\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/25f1925b-8d3c-4920-8f90-f0239cb50657\" /\u003e | \u003cimg width\u003d\"2506\" height\u003d\"1416\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/579b7365-0bc0-487b-bcbd-41cd2bd07ead\" /\u003e |\n\nMemory overhead (measured with `jmap -histo:live`):\n\n| Metric | Value |\n|---|---|\n| Total added memory | **~81 MB** |\n| vs `taskInfos` overhead (829 MB) | ~10% |\n| vs Driver heap old gen used (9.3 GB) | \u003c 1% |\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nAdded 2 tests in `TaskSetManagerSuite`\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #55030 from DenineLu/optimize-executor-lost.\n\nAuthored-by: DenineLu \u003cdeninelu@163.com\u003e\nSigned-off-by: Chao Sun \u003cchao@openai.com\u003e\n"
    },
    {
      "commit": "224f30fa8a1b4912f161126df82dafccfae6d5a1",
      "tree": "549f9b56a5f0000c13444d46f6e1e871bb6da226",
      "parents": [
        "28e9e6965c2867022f2f39ff564d447f15089fc3"
      ],
      "author": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Tue Mar 31 17:37:07 2026 +0200"
      },
      "committer": {
        "name": "Peter Toth",
        "email": "peter.toth@gmail.com",
        "time": "Tue Mar 31 17:37:07 2026 +0200"
      },
      "message": "[SPARK-56241][SQL] Derive `outputOrdering` from `KeyedPartitioning` key expressions\n\n### What changes were proposed in this pull request?\n\nWithin a `KeyedPartitioning` partition, all rows share the same key value, so the key expressions are trivially sorted within each partition.\n\nThis PR makes two plan nodes expose that structural guarantee via `outputOrdering`:\n\n- `DataSourceV2ScanExecBase.outputOrdering`\nPreviously this returned the source-reported ordering (via `SupportsReportOrdering`) or fell back to the empty default. It now also handles the case where no ordering is reported but the output partitioning is a `KeyedPartitioning`: since every row in a partition evaluates to the same constant value for the key expressions, the partition is trivially sorted by those expressions.\nThis feature can be enabled with a new `spark.sql.sources.v2.bucketing.partitionKeyOrdering.enabled` config.\n\n- `GroupPartitionsExec.outputOrdering`\nPreviously the coalescing branch always returned `super.outputOrdering` (empty), discarding any ordering the child produced. It now distinguishes the following cases:\n\n  - *No coalescing* (all groups contain exactly one partition): the child\u0027s within-partition ordering is fully preserved -- `child.outputOrdering` is returned as-is, including any key-derived ordering that `DataSourceV2ScanExecBase` already set.\n\n  - *Coalescing without reducers* (multiple input partitions merged into one): all merged partitions share the same original key value, so key expressions are constant across all merged rows. The child\u0027s `outputOrdering` should already be in sync with the partitioning (it was either reported by the source or derived from `KeyedPartitioning` in `DataSourceV2ScanExecBase`), so we simply filter it to keep only the sort orders whose expression is a partition key expression.\nThis feature can be enabled with a new `spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled` config.\n\n  - *Coalescing with reducers*: merged partitions no longer share the same original key values, so empty ordering is returned.\n\n### Why are the changes needed?\n\nBefore this change, `outputOrdering` on both nodes returned an empty sequence (unless `SupportsReportOrdering` was implemented), even though the within-partition ordering was structurally guaranteed by the partitioning itself. As a result, `EnsureRequirements` would insert a redundant `SortExec` in some cases.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. Queries that use key-partitioned tables may now avoid inserting some `SortExec`s.\n\n### How was this patch tested?\n\n- New `GroupPartitionsExecSuite` covering all branches of the updated `outputOrdering` logic.\n- New SQL-level tests in `KeyGroupedPartitioningSuite` validating end-to-end plan shapes.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Sonnet 4.6\n\nCloses #55036 from peter-toth/SPARK-56241-outputordering-from-keyedpartitioning.\n\nAuthored-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\nSigned-off-by: Peter Toth \u003cpeter.toth@gmail.com\u003e\n"
    },
    {
      "commit": "28e9e6965c2867022f2f39ff564d447f15089fc3",
      "tree": "e37abfb7699fb863ac58620319ae46aa8c57d913",
      "parents": [
        "a67f1eca72c48dd5ece5c4b58b773239b98aa7af"
      ],
      "author": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Mar 31 22:25:05 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Mar 31 22:25:05 2026 +0800"
      },
      "message": "[SPARK-56074][INFRA] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes\n\n### What changes were proposed in this pull request?\n\nRewrite AGENTS.md to follow industry best practices for AI coding agent instructions:\n- Git pre-flight checks (uncommitted changes, branch state) before making edits\n- Inline actionable SBT build/test commands instead of links to docs\n- PySpark test setup with venv and Python version check\n- Development notes for SQLQueryTestSuite golden file tests and Spark Connect proto definitions\n- PR workflow guidelines (title format, description template, fork/upstream workflow)\n- Add CLAUDE.md as a symlink to AGENTS.md so Claude Code also picks up the instructions\n\n### Why are the changes needed?\n\nThe existing AGENTS.md only contained links to build docs, which AI coding agents cannot follow. The rewrite provides inline commands and actionable guidance that agents can directly use. All commands have been verified to work.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nAll build and test commands were manually verified:\n- `build/sbt sql/compile`, `build/sbt sql/Test/compile`\n- `build/sbt \"sql/testOnly *MySuite\"`, `build/sbt \"sql/testOnly *MySuite -- -z \\\"test name\\\"\"`\n- PySpark venv setup and `python/run-tests` with both test suite and single test case\n- Confirmed `OBJC_DISABLE_INITIALIZE_FORK_SAFETY` is no longer needed on macOS\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Opus 4.6\n\nCloses #54899 from cloud-fan/ai.\n\nAuthored-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    },
    {
      "commit": "a67f1eca72c48dd5ece5c4b58b773239b98aa7af",
      "tree": "2a74a712f3920f1adb51ef0e18e6009e094d7e78",
      "parents": [
        "640476d6283795ec09e97ed13a54d671aedf39ef"
      ],
      "author": {
        "name": "Kent Yao",
        "email": "kentyao@microsoft.com",
        "time": "Tue Mar 31 06:05:05 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 06:05:05 2026 -0700"
      },
      "message": "[SPARK-56137][UI][TESTS] Add regression tests for SQL tab DataTables migration\n\n### What changes were proposed in this pull request?\n\nAdd regression tests for the SQL tab DataTables migration (SPARK-55875 umbrella).\n\n**`AllExecutionsPageSuite`** — 4 new tests:\n- DataTables CSS/JS resource inclusion verification\n- `group-sub-exec` config propagation (default and explicit)\n- Loading spinner rendering\n- REST API integration (run query → verify `/api/v1/.../sql` endpoint)\n\n**`SqlResourceWithActualMetricsSuite`** — 4 new tests:\n- ISO date format validation for `submissionTime`\n- No artificial result limit (5 queries all returned)\n- `planDescription` in detail endpoint\n- Multiple query types (DDL, DML, SELECT) all appear with COMPLETED status\n\n### Why are the changes needed?\n\nThe SQL tab was migrated from server-side rendered HTML tables to client-side DataTables. These regression tests ensure correctness of:\n1. Page skeleton rendering (resources, config, spinner)\n2. REST API responses (date format, no limit, status, plan description)\n3. Coverage across query types (DDL, DML, SELECT)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. Test-only changes.\n\n### How was this patch tested?\n\nAll 12 tests pass:\n- `AllExecutionsPageWithInMemoryStoreSuite` (5 tests)\n- `AllExecutionsPageWithRocksDBBackendSuite` (5 tests)\n- `SqlResourceWithActualMetricsSuite` (8 tests)\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: GitHub Copilot (Claude Opus 4.6)\n\nCloses #55111 from yaooqinn/SPARK-56137.\n\nAuthored-by: Kent Yao \u003ckentyao@microsoft.com\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "640476d6283795ec09e97ed13a54d671aedf39ef",
      "tree": "424d4a67fd01602f5e6fcd13fc0ec430b36ef692",
      "parents": [
        "a1de465b379f9bb59c3bc99ed798f353a1d1bf3a"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 05:54:53 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 05:54:53 2026 -0700"
      },
      "message": "[SPARK-56303][K8S] Add Java-friendly factory methods to `JavaMainAppResource`\n\n### What changes were proposed in this pull request?\n\nThis PR adds a companion object to `JavaMainAppResource` with Java-friendly factory methods:\n- `JavaMainAppResource.of(primaryResource)` — creates with a specific resource path\n- `JavaMainAppResource.create()` — creates with no resource (uses spark-internal)\n\n### Why are the changes needed?\n\nThe new factory methods allow Java callers to simply write:\n\n```java\nJavaMainAppResource res \u003d JavaMainAppResource.of(\"path/to/jar\");\nJavaMainAppResource empty \u003d JavaMainAppResource.create();\n```\n\nwhile the existing Scala API remains unchanged.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes. New public factory methods are added to `JavaMainAppResource` for Java interoperability. No existing behavior is changed.\n\n### How was this patch tested?\n\nPass the CIs with the newly added test case.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code\n\nCloses #55112 from dongjoon-hyun/SPARK-56303.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "a1de465b379f9bb59c3bc99ed798f353a1d1bf3a",
      "tree": "7da8d04b509f26af58b44a392101b0308276dab1",
      "parents": [
        "292e1d56c5e81d1de346efe3ef12526f0158d121"
      ],
      "author": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 00:40:01 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 00:40:01 2026 -0700"
      },
      "message": "[SPARK-56300][K8S] Add Java-friendly factory method to `KubernetesDriverSpec`\n\n### What changes were proposed in this pull request?\n\nThis PR adds a Java-friendly `create` factory method to `KubernetesDriverSpec` companion object that accepts `java.util.List` and `java.util.Map` instead of Scala `Seq` and `Map`.\n\n### Why are the changes needed?\n\n`KubernetesDriverSpec` is a `DeveloperApi` public class used by external projects such as the `Apache Spark K8s Operator`. Since it exposes Scala collection types (`Seq[HasMetadata]`, `Map[String, String]`), Java callers must manually convert Java collections to Scala equivalents. The new `create` method provides a convenient entry point for Java users.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo behavior change because this is a new additional API.\n\n### How was this patch tested?\n\nPass the CIs with newly added test suite.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGenerated-by: Claude Code (claude-opus-4-6)\n\nCloses #55106 from dongjoon-hyun/SPARK-56300.\n\nAuthored-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "292e1d56c5e81d1de346efe3ef12526f0158d121",
      "tree": "3f3ef6426003a63ba07f988319bc74e57e68d928",
      "parents": [
        "0419e68b7e7cdda443a44efa3abe251595af3e0a"
      ],
      "author": {
        "name": "Ruifeng Zheng",
        "email": "ruifengz@apache.org",
        "time": "Tue Mar 31 00:08:56 2026 -0700"
      },
      "committer": {
        "name": "Dongjoon Hyun",
        "email": "dongjoon@apache.org",
        "time": "Tue Mar 31 00:08:56 2026 -0700"
      },
      "message": "[SPARK-56301][PYTHON] Fix typos in `error-conditions.json`\n\n### What changes were proposed in this pull request?\nfix typos\n\n### Why are the changes needed?\nto fix typos\n\n### Does this PR introduce _any_ user-facing change?\nerror-message changes\n\n### How was this patch tested?\nCI\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nCo-authored-by: Claude Code (Opus 4.6)\n\nCloses #55105 from zhengruifeng/py_error_cleanup.\n\nAuthored-by: Ruifeng Zheng \u003cruifengz@apache.org\u003e\nSigned-off-by: Dongjoon Hyun \u003cdongjoon@apache.org\u003e\n"
    },
    {
      "commit": "0419e68b7e7cdda443a44efa3abe251595af3e0a",
      "tree": "33ab5bb71910c71e6eda87446fc8fd1e40aa6a41",
      "parents": [
        "6e8c690f82c00c17fc103c1af9eb13a330c5bf43"
      ],
      "author": {
        "name": "Helios He",
        "email": "helios.he@databricks.com",
        "time": "Tue Mar 31 13:27:04 2026 +0800"
      },
      "committer": {
        "name": "Wenchen Fan",
        "email": "wenchen@databricks.com",
        "time": "Tue Mar 31 13:27:04 2026 +0800"
      },
      "message": "[SPARK-56155][SQL] Collect_list/collect_set sql() function includes \"RESPECT NULLS\"\n\n### What changes were proposed in this pull request?\n\nFix the pretty string for collect_list/set column alias.  When collect_list/collect_set appear in the column header, the string includes \"RESPECT NULLS\".\n\n### Why are the changes needed?\n\nFor clarity.  Output may be misleading if we don\u0027t show \u0027RESPECT NULLS\u0027 in the column header even when user has included it.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes.  Column headers now include \u0027RESPECT NULLS\u0027 when the keyword has been included.\n\n### How was this patch tested?\n\nUT in `DataFrameAggregateSuite`\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #54957 from helioshe4/collect-list-set-sql-include-respect-nulls.\n\nAuthored-by: Helios He \u003chelios.he@databricks.com\u003e\nSigned-off-by: Wenchen Fan \u003cwenchen@databricks.com\u003e\n"
    }
  ],
  "next": "6e8c690f82c00c17fc103c1af9eb13a330c5bf43"
}
