)]}'
{
  "log": [
    {
      "commit": "f2f6203aa496cd5e502690cc7fea913acc93dd61",
      "tree": "0a10cedeca37145e82b7948491ceae43b59abd10",
      "parents": [
        "1fffd70db1cb7b7e0441a09842b0d152dfaafa2e"
      ],
      "author": {
        "name": "Xinli Shang",
        "email": "shangx@uber.com",
        "time": "Sun May 10 09:26:06 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sun May 10 21:56:06 2026 +0530"
      },
      "message": "fix(flink): add Apache license header to muttley/README.md (#18713)\n\nPR #18394 added hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/muttley/README.md\nwithout an Apache license header, causing apache-rat:check to fail on every\nnew build of master (\"Too many files with unapproved license: 1\").\n\nPrepend the standard Apache 2.0 HTML-comment license header so RAT passes.\n\nVerified locally:\n  cd hudi-flink-datasource/hudi-flink\n  mvn -Pflink1.20 -Dscala-2.12 apache-rat:check\n  -\u003e Rat check: Summary over all files. Unapproved: 0\n\nCo-authored-by: Xinli Shang \u003cshangxinli@apache.org\u003e"
    },
    {
      "commit": "1fffd70db1cb7b7e0441a09842b0d152dfaafa2e",
      "tree": "3ae264e1b6928fd424814d418842291775e031fe",
      "parents": [
        "f8e4b9f3db52fcaab05a37b15f7ea7ce8c8b3355"
      ],
      "author": {
        "name": "chaoyang",
        "email": "chaoyang@apache.org",
        "time": "Sun May 10 11:03:19 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sun May 10 11:03:19 2026 +0800"
      },
      "message": "perf: Reduce unnecessary FSDataOutputStream#hsync to enhance append performance (#17517)\n\n* perf: Reduce unnecessary `FSDataOutputStream#hsync` to enhance append performance\n\n1. Reduce unnecessary `FSDataOutputStream#hsync` to enhance append performance\n\nSigned-off-by: TheR1sing3un \u003cchaoyang@apache.org\u003e\n\n* feat: flush behavior compatible with the block append mode\n\n1. flush behavior compatible with the block append mode\n\nSigned-off-by: TheR1sing3un \u003cchaoyang@apache.org\u003e\n\n* fixup: address review - drop syncDuringFlush, expose explicit sync()\n\nFollowing @danny0405\u0027s suggestion in the PR review, ensure only\ncommit-level visibility on the production path:\n\n- Remove the `withSyncDuringFlush` builder option and the\n  `flush(boolean)` overload on HoodieLogFormatWriter; the production\n  path no longer flushes or hsyncs at appendBlocks.\n- Expose `Writer#sync()` (flush + hsync) as an explicit API for tests\n  that assert per-append visibility on the underlying file system.\n- closeStream still calls sync() once before close so a closed writer\n  guarantees data is persisted to DataNodes.\n- Update tests that previously relied on `withSyncDuringFlush(true)`\n  to call `writer.sync()` explicitly before per-append FileStatus\n  size assertions, and rename the related assertion message to drop\n  the misleading \"auto-flushed\" wording.\n\n---------\n\nSigned-off-by: TheR1sing3un \u003cchaoyang@apache.org\u003e"
    },
    {
      "commit": "f8e4b9f3db52fcaab05a37b15f7ea7ce8c8b3355",
      "tree": "052b6ce6016c1e5ea0fe0f17283a45f76a2a2479",
      "parents": [
        "bab463aa1ca8c5f5873289a381e762982889f132"
      ],
      "author": {
        "name": "Xinli Shang",
        "email": "shangx@uber.com",
        "time": "Sat May 09 19:10:17 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat May 09 19:10:17 2026 -0700"
      },
      "message": "docs: Document muttley package as internal/optional for OSS users (#18394)"
    },
    {
      "commit": "bab463aa1ca8c5f5873289a381e762982889f132",
      "tree": "598bf335b185de950616aa3049febda2a66c3da5",
      "parents": [
        "93e334ce042d72418c77abaa3246231b96a5615c"
      ],
      "author": {
        "name": "Shuo Cheng",
        "email": "njucshuo@gmail.com",
        "time": "Sat May 09 20:39:27 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat May 09 20:39:27 2026 +0800"
      },
      "message": "feat(flink): Support data skipping based on column stats for source V2 (#18706)"
    },
    {
      "commit": "93e334ce042d72418c77abaa3246231b96a5615c",
      "tree": "4bfbbd6fd5c423bed100e0f9c677d4e3835cf5ef",
      "parents": [
        "004d159968964d17875adb0ba97a51a206dfe1ff"
      ],
      "author": {
        "name": "Shuo Cheng",
        "email": "njucshuo@gmail.com",
        "time": "Sat May 09 20:29:41 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat May 09 20:29:41 2026 +0800"
      },
      "message": "feat(flink): Support dynamic bucket for flink streaming with partitio… (#18640)"
    },
    {
      "commit": "004d159968964d17875adb0ba97a51a206dfe1ff",
      "tree": "3a14c7be3a32c73b609cf3b4d869aa3850d426fa",
      "parents": [
        "47bf4e41342e4f1dab26a7fb1489f278fbd1226c"
      ],
      "author": {
        "name": "Shihuan Liu",
        "email": "skywalker0618@gmail.com",
        "time": "Sat May 09 05:16:58 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat May 09 20:16:58 2026 +0800"
      },
      "message": "refactor(flink): Remove legacy Parquet nested readers superseded by Flink 2.1 Dremel path (FLINK-35702) (#18701)\n\n* refactor(flink): Remove legacy Parquet nested readers superseded by Flink 2.1 Dremel path (FLINK-35702)\n* Fix flaky IT test"
    },
    {
      "commit": "47bf4e41342e4f1dab26a7fb1489f278fbd1226c",
      "tree": "93ce59894780880fadaf41eb88bf3acab90c7a50",
      "parents": [
        "34e9c7c5bbdc30143a3f2dbb6931149f6350357f"
      ],
      "author": {
        "name": "Shihuan Liu",
        "email": "skywalker0618@gmail.com",
        "time": "Thu May 07 20:07:05 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 11:07:05 2026 +0800"
      },
      "message": "feat(flink): Wire Flink 2.1 nested Parquet readers into the Hudi read path (FLINK-35702) (#18700)"
    },
    {
      "commit": "34e9c7c5bbdc30143a3f2dbb6931149f6350357f",
      "tree": "cfb5d08a404cacbf0f7998d098064efa8f5f9e04",
      "parents": [
        "63f721ddd07789b65b49167aa86e23e41f390d70"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Thu May 07 19:05:05 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 19:05:05 2026 +0800"
      },
      "message": "test(schema): Add MOR log-only compaction tests for custom types (#18583)\n\n* test(schema): Add MOR log-only compaction tests for custom types\n\nCover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor\nand payload shape of a custom-typed column survive inline compaction of\na log-only MOR table into a base file.\n\n- TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata\n  (5 commits via SQL + MERGE INTO to trigger default inline compaction).\n- TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+,\n  asserting native VariantType round-trips through compaction.\n- TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline\n  uses named_struct with hex byte literals; out-of-line creates real files\n  via BlobTestHelpers.createTestFile and verifies bytes via read_blob().\n\n* test(schema): Address review comments on MOR log-only compaction tests\n\n- Pin hoodie.compact.inline.max.delta.commits \u003d \u00275\u0027 on all 4 tables so\n  compaction triggers deterministically rather than via the implicit\n  default\n- Rename path to externalPath in outOfLineBlobLiteral\n- Fail with the missing id in embeddingOf instead of a bare .get\n- Extract val tablePath in the variant test for consistency"
    },
    {
      "commit": "63f721ddd07789b65b49167aa86e23e41f390d70",
      "tree": "64b7b7dcf794e08cacf469c8f8a587d376c161e8",
      "parents": [
        "87019a332db78e88ccb7968739a91a45f8f1a06f"
      ],
      "author": {
        "name": "Matthew",
        "email": "hushihao2020x@163.com",
        "time": "Thu May 07 18:25:28 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 18:25:28 2026 +0800"
      },
      "message": "fix: Fix reflection ctor signature for AwsGlueCatalogSyncTool in HiveSyncContext (#18697)"
    },
    {
      "commit": "87019a332db78e88ccb7968739a91a45f8f1a06f",
      "tree": "9ee3f484cb916b1bff2f1805339939960313a3d4",
      "parents": [
        "40295605aae691da07e93a1a2919d38b41cbb6dd"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Thu May 07 17:29:41 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 17:29:41 2026 +0800"
      },
      "message": "fix(hive): Tolerate pruned ArrayWritable in nested BLOB projection (#18581)\n\nFixes issue: #18577\n\nWhen Hive\u0027s FetchOperator pushes nested column projection (e.g.\n`SELECT blob_data.reference.external_path`) through Parquet via\n`hive.io.file.readNestedColumn.paths`, the reader returns a compacted\nArrayWritable holding only the projected sub-fields in low slots,\nwhile oldSchema stays the full 3-field canonical BLOB\n(BlobLogicalType.validate rejects partial field lists; pruneDataSchema\ndeliberately preserves the canonical shape). Positional indexing into\nthe compacted array AIOBEs, and even with a bounds guard, Hive\u0027s\nObjectInspector downstream expects projected values at their\ncanonical positions - the rewrite must remap, not just survive.\n\nIntroduce a projection-aware rewrite path:\n\n- HoodieProjectionMask (new) - immutable per-level descriptor of\n  physical layout. isCanonicalAtThisLevel() means schema positions\n  apply; otherwise physicalIndexOf / physicalOrder map field names\n  to physical slots.\n- HoodieColumnProjectionUtils.buildNestedProjectionMask() - parses\n  hive.io.file.readNestedColumn.paths, walks RECORD / BLOB / VARIANT,\n  returns the matching mask (or all() when projection is absent).\n- HiveHoodieReaderContext threads the mask into a new 5-arg\n  rewriteRecordWithNewSchema overload.\n- HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal\n  branches on the mask:\n    - rewriteCanonicalRecord - legacy positional logic with a\n      defensive oldField.pos() \u003c arrayLength guard.\n    - rewriteCompactedRecord - iterates physicalOrder() and writes\n      each projected slot at its canonical position so the\n      downstream ObjectInspector finds fields where it expects them.\n\nThe compacted path is the primary fix; the canonical-path bounds\nguard is a defensive fallback.\n\nTests: TestHoodieColumnProjectionUtils covers mask construction;\nTestHoodieArrayWritableSchemaUtils covers the AIOBE reproducer,\ncompacted round-trip, and a canonical-shape regression.\nHoodieSchemaTestUtils gains createPlainBlobRecord and\ncreatePlainVariantRecord helpers (variant helper for upcoming\nVARIANT parity)."
    },
    {
      "commit": "40295605aae691da07e93a1a2919d38b41cbb6dd",
      "tree": "52c225d33112da4aaa320cd9fdd1eb3fa0bd1009",
      "parents": [
        "c36a5f7f517e3349385d95bf1ff1e55ba9caa275"
      ],
      "author": {
        "name": "Shihuan Liu",
        "email": "skywalker0618@gmail.com",
        "time": "Wed May 06 23:23:48 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 14:23:48 2026 +0800"
      },
      "message": "feat(flink): Backport Flink 2.1 nested Parquet column readers and INT64 timestamp dispatch (FLINK-35702) (#18636)\n\n* feat(flink): Backport Flink 2.1 nested Parquet column readers and INT64 timestamp dispatch (FLINK-35702)\n* Minor fixes"
    },
    {
      "commit": "c36a5f7f517e3349385d95bf1ff1e55ba9caa275",
      "tree": "0e7daff73142bbc6db8ce83c4cda504857caf3a1",
      "parents": [
        "91f341f8fa795879ccb32784ea7f12af0feab82d"
      ],
      "author": {
        "name": "Shuo Cheng",
        "email": "njucshuo@gmail.com",
        "time": "Thu May 07 08:49:13 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 08:49:13 2026 +0800"
      },
      "message": "fix(flink): Avoid emitting deletes for Flink source v2 batch reads (#18694)"
    },
    {
      "commit": "91f341f8fa795879ccb32784ea7f12af0feab82d",
      "tree": "c6a951ca53fe99aeffa206097d5a49730dc93c5e",
      "parents": [
        "471bb48338bd9642bd593472b029915180156038"
      ],
      "author": {
        "name": "Prashant Wason",
        "email": "pwason@uber.com",
        "time": "Mon May 04 19:06:26 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 10:06:26 2026 +0800"
      },
      "message": "fix: filter EXTERNAL property in SparkCatalogMetaStoreClient.toCatalogTable (#18672)\n\nHudi\u0027s `HMSDDLExecutor.createTable` sets both `tableType\u003dEXTERNAL_TABLE`\nand `parameters[EXTERNAL]\u003dTRUE` on the Hive Table object when the table\nis external. When that Table flows through `SparkCatalogMetaStoreClient`\ninto `HiveExternalCatalog`, `verifyTableProperties` rejects:\n\n  AnalysisException: Cannot set or change the preserved property key:\n  \u0027EXTERNAL\u0027\n\nSpark uses `CatalogTableType.EXTERNAL` on the `CatalogTable` itself to\nencode external-ness, and treats `EXTERNAL\u003d...` as a duplicate (and\nforbidden) encoding. We already map `tableType` correctly via\n`if (\"EXTERNAL_TABLE\".equalsIgnoreCase(table.getTableType))`, so dropping\nthe property in the same filter that already strips `spark.sql.*` is safe.\n\nSame family as #18654 (filter `spark.sql.*`).\n\nAdds a regression test mirroring the real `HMSDDLExecutor` shape:\n`tableType\u003dEXTERNAL_TABLE` AND `parameters[EXTERNAL]\u003dTRUE`.\n\nCo-authored-by: Claude Opus 4.7 \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "471bb48338bd9642bd593472b029915180156038",
      "tree": "66fbc137bf56e460dbae65ad9c31585e52003f51",
      "parents": [
        "127c6ee031867a7a6b49c6f3ccadbba8161ee8ef"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Mon May 04 13:42:43 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 13:42:43 2026 -0700"
      },
      "message": "refactor: move checkpoint metadata lookup helper to hudi-common (#18489)\n\nThis PR moves the checkpoint metadata lookup helper into hudi-common so ingestion-related code can reuse the same timeline utility instead of keeping the logic in utilities-only code."
    },
    {
      "commit": "127c6ee031867a7a6b49c6f3ccadbba8161ee8ef",
      "tree": "11bfd22cb90d1c761d42695237fcd023ad5439bc",
      "parents": [
        "4d0e9cd47f9e70564dee2ea0718f8272d8eb4df4"
      ],
      "author": {
        "name": "Krishen",
        "email": "22875197+kbuci@users.noreply.github.com",
        "time": "Mon May 04 13:33:22 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 13:33:22 2026 -0700"
      },
      "message": "feat(common): roll over commit metadata to clean (#18590)\n\nWhen rolling metadata is configured (hoodie.write.rolling.metadata.keys), important metadata like schema and checkpoint keys are carried forward across commits. However, clean instants do not participate in this rolling mechanism, they neither receive rolled-over metadata nor serve as a source for subsequent lookups. After archival removes old ingestion commits, if only clean instants remain on the active timeline between surviving commits, the chain of rolled-over metadata can break.\n\nThis PR ensures that clean commits also carry rolled-over metadata in their extraMetadata field, preserving the rolling metadata chain across archival.\n\n\n\n\n---------\n\nCo-authored-by: Krishen Bhan \u003c“bkrishen@uber.com”\u003e"
    },
    {
      "commit": "4d0e9cd47f9e70564dee2ea0718f8272d8eb4df4",
      "tree": "889ac260b6c3cb7a032880d854e0d92b548f68fc",
      "parents": [
        "cde0e3907e174d66476b1f001f41ef19823b7be5"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Sun May 03 09:47:22 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sun May 03 09:47:22 2026 -0700"
      },
      "message": "fix(lance): prevent file splitting for Lance base files to avoid duplicate reads (#18678)"
    },
    {
      "commit": "cde0e3907e174d66476b1f001f41ef19823b7be5",
      "tree": "6b5e068f07a42122cfae2e336aa49ecf5c21955c",
      "parents": [
        "695294cd138463730a2dedb78b81ed827421af1b"
      ],
      "author": {
        "name": "Hudi Agent",
        "email": "yihua.guo.bot@gmail.com",
        "time": "Sat May 02 15:30:42 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sun May 03 06:30:42 2026 +0800"
      },
      "message": "fix(ci): bump Maven heap to 8g to fix OOM in CI builds (#18618)\n\nIncrease MAVEN_OPTS heap from 4g to 8g and compiler maxmem from 4096m\nto 8192m in GitHub Actions bot.yml. Also add -Xmx8g to the Azure\nPipelines bundle validation build step.\n\nCo-authored-by: hudi-agent \u003c277184175+hudi-agent@users.noreply.github.com\u003e"
    },
    {
      "commit": "695294cd138463730a2dedb78b81ed827421af1b",
      "tree": "d9428a97a3f144d352d92a01f875f1e5af5b8a61",
      "parents": [
        "f7508ded95a534f5e6e5ffc5150a83a5662d05db"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Sat May 02 05:52:48 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat May 02 20:52:48 2026 +0800"
      },
      "message": "fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks (#18675)\n\n* Honor SparkSession overrides for rebase mode and timezone in compaction tasks\n\nWhen MOR compaction runs outside a Spark SQL execution context (e.g. a\nstandalone CompactTask runner), `SQLConf.get` on the executor task thread\nreturns a fresh fallback `SQLConf` with default values, not the user\u0027s\nSparkSession overrides. As a result, `Spark{3_3,3_4,3_5,4_0}Adapter\n.getDateTimeRebaseMode()` resolved to `EXCEPTION` even when the user had\nset `spark.sql.parquet.datetimeRebaseModeInWrite\u003dLEGACY`, producing\n`SparkUpgradeException [INCONSISTENT_BEHAVIOR_CROSS_VERSION\n.WRITE_ANCIENT_DATETIME]` during compaction of MOR tables containing\npre-1900 timestamps. The same gap affected\n`HoodieRowParquetWriteSupport.init()`\u0027s `sessionLocalTimeZone` read.\n\nAdapter and WriteSupport now resolve the value in this order:\n  1. SQLConf override (so `spark.conf.set(...)` on the SparkSession takes\n     effect on the driver and inside SQL execution contexts).\n  2. SparkConf via SparkEnv.get.conf (broadcast to every executor at\n     startup, so user-set keys are honored on executor tasks running\n     outside a SQL execution context).\n  3. The ConfigEntry\u0027s own default (or SQLConf.sessionLocalTimeZone for\n     the timezone helper).\n\nAdds TestSparkAdapterRebaseModePropagation (3 methods) covering rebase\nmode and timezone propagation into vanilla parallelize().map() task\nclosures. Each test fails without the fix.\n\n* Apply fix to Spark4_1Adapter; use flatMap+Option to avoid null inside Option\n\n* Use SQLConf.getConf(entry, null) instead of getConfString(key, null)\n\n* Make Spark4_1 consistent with 3.x/4.0; add default-behavior test; trim scaladocs\n\n* Add unit test for resolveSessionLocalTimeZone in hudi-spark-client\n\n* Add SQLConf-override test method to lift coverage on new code\n\n* Drop redundant public modifiers from JUnit 5 test class and methods\n\n* Read expected default from SQLConf so test works on Spark 3.x and 4.1\n\n* Document why init() coverage lives in hudi-spark integration tests\n\n* Inline single-use Parquet metadata keys; keep only timeZone constant\n\n* Add SparkConf-branch test for resolveSessionLocalTimeZone"
    },
    {
      "commit": "f7508ded95a534f5e6e5ffc5150a83a5662d05db",
      "tree": "07b3cea6e8219f685496eeb380fa89d7881f1f22",
      "parents": [
        "f8d70cb58d7c9ea6a96ece59ef4ce097c46c3921"
      ],
      "author": {
        "name": "ashokkumar-allu",
        "email": "65997235+ashokkumar-allu@users.noreply.github.com",
        "time": "Fri May 01 12:32:02 2026 -0500"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 10:32:02 2026 -0700"
      },
      "message": "test(spark): Add date logical type test to TestAvroConversionUtils (#18584)\n\n* [MINOR] Add date logical type test to TestAvroConversionUtils\n\nAdd a test case that verifies createConverterToRow correctly handles\nAvro\u0027s date logical type (int with logicalType\u003ddate), ensuring the\nconversion from epoch-days integer to java.sql.Date preserves the\ncorrect date value.\n\n* [MINOR] Address review comments: rename convertor to converter, add comments\n\n---------\n\nCo-authored-by: gallu \u003cgallu@uber.com\u003e"
    },
    {
      "commit": "f8d70cb58d7c9ea6a96ece59ef4ce097c46c3921",
      "tree": "7cc7377706e87160dd894f0983f4f58f4de9f06d",
      "parents": [
        "39797b42cfe95e9b36d718a461027b6c056bfd4d"
      ],
      "author": {
        "name": "chaoyang",
        "email": "chaoyang@apache.org",
        "time": "Fri May 01 11:37:54 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 11:37:54 2026 +0800"
      },
      "message": "feat: Introduce a Spark procedure to trigger LSM timeline compaction (#18659)\n\n* feat: Introduce a Spark procedure to trigger LSM timeline compaction\n\n* fixup: address review on RunTimelineCompactionProcedure\n\n- Acquire the txn state-change lock around compactAndClean so the procedure cannot race with a concurrent archival/compaction over the LSM timeline manifest\n- Use TimeUnit.NANOSECONDS.toMillis for self-documenting unit conversion"
    },
    {
      "commit": "39797b42cfe95e9b36d718a461027b6c056bfd4d",
      "tree": "85c19305c779e5b884102f6f9d1169db34c7b7e7",
      "parents": [
        "38db5ed05e5c9823ac676cce079c4735b46d68f4"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Fri May 01 11:36:34 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 11:36:34 2026 +0800"
      },
      "message": "docs: claim RFC-104 for schema evolution unification (#18660)"
    },
    {
      "commit": "38db5ed05e5c9823ac676cce079c4735b46d68f4",
      "tree": "4ee7d7883544652f50ba54f60bc37fd5146765b3",
      "parents": [
        "0a28d695eb1558b01bb38b11996a436a2e3e23dd"
      ],
      "author": {
        "name": "vinoth chandar",
        "email": "vinothchandar@users.noreply.github.com",
        "time": "Thu Apr 30 09:32:27 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 12:32:27 2026 -0400"
      },
      "message": "chore: Add context7.json with URL and public key (#18662)"
    },
    {
      "commit": "0a28d695eb1558b01bb38b11996a436a2e3e23dd",
      "tree": "72505061824ca505fc097b4ee624cb0b08b92ee0",
      "parents": [
        "744385680b119e13ed276fd8ee96990600f1cd9e"
      ],
      "author": {
        "name": "kartikeyaagrawal",
        "email": "52993905+kartikeyaagrawal@users.noreply.github.com",
        "time": "Thu Apr 30 15:10:08 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 17:40:08 2026 +0800"
      },
      "message": "chore(docker): reduce base_java17 and spark_base image size (#18542)\n\nAddresses #18523.\n\nShrinks the Java 17 integration-test image from ~3.56 GB to ~2.58 GB\n(~27%) without changing any runtime behavior. The container commands,\nenvironment variables, exposed ports, and entrypoint are identical.\n\nbase_java17/Dockerfile:\n- switch base image from eclipse-temurin:17-jdk to 17-jre-jammy. The\n  container only runs Hadoop; no Java compilation happens inside it,\n  so the JDK toolchain is not needed.\n- convert to a multi-stage build. Stage 1 downloads and extracts the\n  Hadoop tarball; stage 2 only COPYs the extracted tree. curl,\n  ca-certificates, and the tar.gz no longer land in the final layer.\n- use --no-install-recommends and clean apt lists in the runtime stage.\n- drop the unused .asc signature download and the now-dead wget dep.\n\nspark_base/Dockerfile:\n- replace the Python-3.10.14-from-source build (which pulled in\n  build-essential and a full compile toolchain, then built CPython\n  with --enable-optimizations inside the image) with the distro\n  python3-minimal + python3-pip packages. PySpark only needs a\n  Python runtime at runtime."
    },
    {
      "commit": "744385680b119e13ed276fd8ee96990600f1cd9e",
      "tree": "b5c4b339bf164349f3d714cd1f651e02751b22f3",
      "parents": [
        "7b662e891e3e157e2e62f56b35a485debbbb8072"
      ],
      "author": {
        "name": "Prashant Wason",
        "email": "pwason@uber.com",
        "time": "Wed Apr 29 19:41:55 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 10:41:55 2026 +0800"
      },
      "message": "fix: make SparkCatalogMetaStoreClient.setMetaConf a no-op (#18652)\n\nHoodieHiveSyncClient invokes IMetaStoreClient.setMetaConf at construction\ntime to forward hive.metastore.callerContext.* properties to the\nmetastore for audit/tracing. With Spark\u0027s external catalog there is no\nremote HMS to receive those values, so throwing UnsupportedOperationException\nunconditionally breaks every sync client that uses\nSparkCatalogMetaStoreClient (it fails before any actual catalog operation\ncan run).\n\nAccept the call silently. The caller-context properties are diagnostic\nmetadata; dropping them is the correct semantic for a non-thrift catalog\nbackend.\n\nCo-authored-by: Claude Opus 4.7 \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "7b662e891e3e157e2e62f56b35a485debbbb8072",
      "tree": "5e665c671c4db78d0ec0943d2f700f3db704b37f",
      "parents": [
        "8d348cc6b927bb63944e11db2b208bccff8fbb55"
      ],
      "author": {
        "name": "Prashant Wason",
        "email": "pwason@uber.com",
        "time": "Wed Apr 29 19:40:56 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 10:40:56 2026 +0800"
      },
      "message": "fix: filter spark.sql.* properties in SparkCatalogMetaStoreClient.toCatalogTable (#18654)\n\nSpark\u0027s HiveExternalCatalog.alterTable / createTable rejects table\nproperties whose keys start with \"spark.sql.\" with:\n\n  AnalysisException: Cannot persist \u003ctable\u003e into Hive metastore as table\n  property keys may not start with \u0027spark.sql.\u0027: [spark.sql.create.version,\n  spark.sql.sources.provider, spark.sql.sources.schema.partCol.0,\n  spark.sql.sources.schema.numParts, spark.sql.sources.schema.numPartCols,\n  spark.sql.sources.schema.part.0]\n\nThese keys are reserved for Spark\u0027s internal use (provider, schema parts,\ncreate version) and Spark itself writes them when persisting a CatalogTable.\nOn the way back through getTable they appear in the parameters map, and\ntoCatalogTable currently passes them straight through. The next alter_table\ncall then trips the validation and the entire HoodieHiveSyncClient flow\nfails - no actual sync happens.\n\nStrip \"spark.sql.*\" keys in toCatalogTable before constructing the\nCatalogTable. Spark re-derives and writes them from the CatalogTable, so\ndropping them on the way in is safe.\n\nCo-authored-by: Claude Opus 4.7 \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "8d348cc6b927bb63944e11db2b208bccff8fbb55",
      "tree": "baf7d31debe586d4cb5401f38c1edfb643d60c81",
      "parents": [
        "426cbb8c232e11eb744edc713754ee5cd946c16e"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Wed Apr 29 09:22:27 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 29 09:22:27 2026 -0700"
      },
      "message": "feat(examples): Add Hudi Unstructed Demo env (#18643)"
    },
    {
      "commit": "426cbb8c232e11eb744edc713754ee5cd946c16e",
      "tree": "01b34156fbd9d96f1ed3d189f9c0878fdb492309",
      "parents": [
        "fd63851bba3d27353ae1949dfa45c292ed688de0"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Wed Apr 29 14:56:06 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 29 14:56:06 2026 +0800"
      },
      "message": "chore(deps): Pin AWS v1 SDK BOM to short-circuit transitive version-range walk (#18619)\n\namazon-kinesis-deaggregator (added in #18224) pulls aws-lambda-java-events\n1.1.0, whose POM declares aws-java-sdk-* deps with soft ranges like\n[1.10.5,). Maven resolves these by walking every published patch version,\nproducing hundreds of POM downloads per clean build. Importing aws-java-sdk-bom\nin dependencyManagement overrides the ranges with a single deterministic\nversion, eliminating the walk."
    },
    {
      "commit": "fd63851bba3d27353ae1949dfa45c292ed688de0",
      "tree": "b19fb59162b3b9be90b1b6a377ad4a795b91c89a",
      "parents": [
        "7c2c56ec3186de1db80fb7fbc4f0580984d4d130"
      ],
      "author": {
        "name": "Peter Huang",
        "email": "peter.huang@uber.com",
        "time": "Tue Apr 28 23:44:34 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 29 14:44:34 2026 +0800"
      },
      "message": "feat(flink): extend Flink quickstart example to use source v2 (#18518)"
    },
    {
      "commit": "7c2c56ec3186de1db80fb7fbc4f0580984d4d130",
      "tree": "38ba3092a1bec6b58fad9502f657ff0676c3d9b0",
      "parents": [
        "642e1d33e12073060a620788987c3ce4b7f84e36"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Wed Apr 29 12:19:17 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 29 12:19:17 2026 +0800"
      },
      "message": "fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWithNewSchema (#18580)\n\nFixes issue: #18578\n\nHoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches\non newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB\n(#18108) and VARIANT (#17833) are Hudi logical types physically stored as\nAvro records but exposed as distinct HoodieSchemaTypes, so a new schema\ntyped BLOB/VARIANT fell through to rewritePrimaryType and threw\n\"cannot support rewrite value for schema type\".\n\nThis reproduces on the Hive read path whenever Hive projects from its\nHMS-derived struct shape (record name \u003d column name, type field \u003d plain\nSTRING) onto Hudi\u0027s canonical BLOB schema (record \"blob\", type \u003d ENUM\nblob_storage_type, logicalType \"blob\") - the exact signature seen in\nITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by\naccident because it maps to Avro FIXED.\n\nAdd case BLOB and case VARIANT fallthrough to the existing RECORD body.\nInner field layouts are fixed by BlobLogicalType.validate /\nVariantLogicalType.validate, so field-by-name iteration is correct. The\nexisting ENUM case at line 137 already handles the STRING -\u003e ENUM\nconversion for the BLOB \"type\" field.\n\nTests pin the fix without Spark / Hive / Testcontainers - they call\nHoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with\nsynthetic schemas that mirror the E2E failure signature, for both BLOB\nand VARIANT."
    },
    {
      "commit": "642e1d33e12073060a620788987c3ce4b7f84e36",
      "tree": "17c8d5822cfd12999d387fba59b885e4dc6df370",
      "parents": [
        "eed9aa7f0639031e4d6b21c87c67bad2a89cafb2"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Mon Apr 27 14:41:08 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 14:41:08 2026 -0700"
      },
      "message": "chore(release): Moving to 1.3.0-SNAPSHOT on master branch (#18620)"
    },
    {
      "commit": "eed9aa7f0639031e4d6b21c87c67bad2a89cafb2",
      "tree": "d8ffe73f92d44085e3428015480d8f2ec9cf7e83",
      "parents": [
        "1ededfd70605d9421e1117728303dd49af71257c"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Tue Apr 28 05:00:23 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 14:00:23 2026 -0700"
      },
      "message": "feat: Add variant support description to RFC-99 (#18274)"
    },
    {
      "commit": "1ededfd70605d9421e1117728303dd49af71257c",
      "tree": "f9b393c28f9ab1369053122456b18dcc2a7fbec8",
      "parents": [
        "5c73bc0da20deaa1f4c74786924bc62fadf93892"
      ],
      "author": {
        "name": "Krishen",
        "email": "22875197+kbuci@users.noreply.github.com",
        "time": "Mon Apr 27 13:52:21 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 13:52:21 2026 -0700"
      },
      "message": "feat(common): When inferring checkpoint/schema from timeline, check non-ingestion write commits (in case they have metadata rolled-over) (#18576)\n\nWhen archival removes all ingestion commits from the active timeline, code paths that infer schema or checkpoint metadata can fail because they only inspect ingestion-type instants (commits whose WriteOperationType.canUpdateSchema() is true). With Hudi\u0027s rolling metadata feature (hoodie.write.rolling.metadata.keys), non-ingestion commits like clustering, compaction, and delete_partition can carry rolled-over schema and checkpoint metadata. However, several inference paths don\u0027t search these commit types. This PR ensures schema and checkpoint resolution falls back to non-ingestion write commits when the latest instant doesn\u0027t carry the needed metadata.\n\nSummary and Changelog\nChanges:\n\nHoodieActiveTimeline / ActiveTimelineV1 / ActiveTimelineV2: Added a boolean filterByCanUpdateSchema overload to getLastCommitMetadataWithValidSchema. When false, the canUpdateSchema filter is skipped, allowing schema discovery from any commit type (clustering, compaction, delete_partition). The no-arg version retains the original behavior (filter enabled).\nTableSchemaResolver: Changed getLatestCommitMetadataWithValidSchema() to call getLastCommitMetadataWithValidSchema(false), so schema resolution searches all completed commit types instead of only ingestion commits.\nBaseHoodieClient: In mergeRollingMetadata, empty-string values are now treated as \"missing\" when checking both the current commit\u0027s existing metadata and values found in prior commits. This prevents an empty string from short-circuiting the walkback.\nInitialCheckpointFromAnotherHoodieTimelineProvider: Switched from getCommitsTimeline() to getWriteTimeline() to include compaction/logcompaction instants. Filters out empty checkpoint strings (not just nulls). Re-throws IOException as HoodieIOException instead of swallowing it.\nTests: Added 2 unit tests in TestTimelineUtils (schema lookup ignoring operation type, empty schema returns empty) and 1 functional test in TestHoodieClientOnCopyOnWriteStorage (rolling metadata preserved across clustering after archival, with TableSchemaResolver still able to find schema).\n\n\n---------\n\nCo-authored-by: Krishen Bhan \u003c“bkrishen@uber.com”\u003e"
    },
    {
      "commit": "5c73bc0da20deaa1f4c74786924bc62fadf93892",
      "tree": "e36fea0b851114859021b96471d4cdd0e913959d",
      "parents": [
        "4bdcdf9b52b5b027d5e215aeaa30be71ac85e72c"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Mon Apr 27 13:14:47 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 13:14:47 2026 -0700"
      },
      "message": "feat(lance): fix lance writer/reader regarding arrow memory limit issue (#18613)"
    },
    {
      "commit": "4bdcdf9b52b5b027d5e215aeaa30be71ac85e72c",
      "tree": "d6d9ec94be585d5a4dc5fc452825052edaebd8c8",
      "parents": [
        "7ae0fd95b3a6d93ab6d1d6886392f0e43ca055f4"
      ],
      "author": {
        "name": "Lin Liu",
        "email": "141371752+linliu-code@users.noreply.github.com",
        "time": "Mon Apr 27 11:10:53 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 11:10:53 2026 -0700"
      },
      "message": "feat: Create JsonKinesisSource (#18224)\n\n* Add KinesisSource\n\n* Fix some issues\n\n* Add tests\n\n* Add more tests\n\n* handle spark parallelism\n\n* Add several critical features:\n\n1. Support aggregated records.\n2. Avoid expired shards blocking the stream.\n\n* Add more tests for these corner cases\n\n* Fix CI failures\n\n* Fix a performance issue\n\n* Filter empty shards before read`\n\n* Readability\n\n* Add more tests\n\n* Add more tests and a bug fix\n\n* Address some comments\n\n* Address some comments\n\n* Fixed some issues based on test in staging\n\n* Address more comments\n\n* iterator model for shard read\n\n* Refctor\n\n* test on data loss scenarios\n\n* Add last arrival time to checkpoint\n\n* Add smart retry\n\n* Address comments\n\n* refactor\n\n* Fix config naming\n\n* Fix CI OOM in test-common-and-other-modules build step\n\n- Change -DskipTests\u003dtrue to -Dmaven.test.skip\u003dtrue to skip test\n  compilation during full-reactor build, avoiding OOM with new\n  Kinesis dependencies\n- Add -Ddocker.skip\u003dtrue to prevent hudi-aws DynamoDB Local container\n  from starting during the build-only step\n\n* Fix Azure CI OOM in UT_FT_10 build step\n\nSame fix as GitHub Actions: add -Dmaven.test.skip\u003dtrue and\n-Ddocker.skip\u003dtrue to the full-reactor install in UT_FT_10 to\nskip test compilation and Docker plugin during the build-only step.\n\n* Fix docker skip property name: use -DskipDocker\u003dtrue\n\nThe parent POM configures docker-maven-plugin with \u003cskip\u003e${skipDocker}\u003c/skip\u003e,\nso the correct property is -DskipDocker\u003dtrue, not -Ddocker.skip\u003dtrue.\n\n* Reduce build parallelism in test-common-and-other-modules\n\nChange -T 2 to -T 1 in the build step to avoid OOM during compilation\nwhen multiple heavy modules (hudi-utilities, hudi-cli-bundle) compile\nin parallel under the 4GB heap limit.\n\n* Revert to -DskipTests with -T 1 for test-common build step\n\n-Dmaven.test.skip\u003dtrue breaks test-jar dependencies across modules.\nUse -DskipTests\u003dtrue (compiles tests, creates test-jars) with -T 1\n(sequential build) to stay within 4GB heap. Keep -DskipDocker\u003dtrue.\n\n* Increase heap to 6g for test-common build step\n\n4GB heap is insufficient for compiling hudi-utilities test sources\nwith new Kinesis dependencies. Increase to 6GB and restore -T 2\nparallelism. GitHub Actions runners have 7GB RAM.\n\n* Use -T 1 with 6g heap for test-common build step\n\n6GB + -T 2 still OOMs during hudi-utilities testCompile when another\nmodule compiles concurrently. Sequential build (-T 1) ensures\nhudi-utilities gets the full 6GB heap for its test compilation.\n\n* Fork compiler JVM for test-common build step\n\nThe main Maven JVM accumulates memory across 50+ module compilations,\ncausing OOM when it reaches hudi-utilities testCompile. Fork a separate\nJVM for each module\u0027s compilation so it starts fresh with up to 4GB.\nRestore -T 2 and -Xmx4g for the main Maven process.\n\n* Disable 2 failing TestJsonKinesisSource tests\n\n- testRecordToJsonInvalidJsonWithShouldAddOffsetsReturnsOriginalString:\n  expects fallback but code throws HoodieException for invalid JSON\n- testCreateCheckpointLocalStackSentinelReplacedWithLastSeq:\n  sentinel replacement not yet implemented\nBoth flagged in PR review for follow-up fixes.\n\n* Disable e2e Kinesis tests and revert CI config changes\n\n- Revert bot.yml and azure-pipelines to match master\n- Remove LocalStack/Testcontainers test infrastructure\n  (KinesisTestUtils, LocalStackJsonKinesisSource) that caused OOM\n  during full-reactor test compilation in CI\n- Remove testcontainers:localstack and amazon-kinesis-aggregator\n  dependencies from hudi-utilities\n- @Disabled the TestKinesisSource nested class in TestHoodieDeltaStreamer\n- Keep all unit tests (TestJsonKinesisSource, TestKinesisCheckpointUtils, etc.)\n- E2e tests to be re-enabled in follow-up with proper CI support\n\n* Remove e2e Kinesis tests and deaggregator test that depend on deleted files\n\nRemove TestKinesisSource nested class from TestHoodieDeltaStreamer\n(references deleted KinesisTestUtils/LocalStackJsonKinesisSource).\nRemove TestKinesisDeaggregator (depends on removed amazon-kinesis-aggregator).\n\n* Fork compiler in test-common build step to avoid OOM\n\nThe full-reactor build accumulates JVM memory across 50+ modules,\ncausing OOM when compiling hudi-utilities test sources with new\nKinesis dependencies. Fork a separate JVM per module compilation\n(-Dmaven.compiler.fork\u003dtrue -Dmaven.compiler.maxmem\u003d4096m) so each\ngets a fresh heap.\n\n---------\n\nCo-authored-by: Y Ethan Guo \u003cethan.guoyihua@gmail.com\u003e"
    },
    {
      "commit": "7ae0fd95b3a6d93ab6d1d6886392f0e43ca055f4",
      "tree": "88628e519d0c32f294f3860a4422f695aca1beeb",
      "parents": [
        "782552a5e02497f31cca78a65bdb5fb1049297a2"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Mon Apr 27 15:48:45 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 00:48:45 2026 -0700"
      },
      "message": "fix(schema): Allow nested projection on BLOB and VARIANT columns in pruneDataSchema (#18566)"
    },
    {
      "commit": "782552a5e02497f31cca78a65bdb5fb1049297a2",
      "tree": "1f453a5995c1f261004fcc5dc03a5b32fc72a193",
      "parents": [
        "20a01051ca6d6ce51039089a7b7a2ff836378783"
      ],
      "author": {
        "name": "yuqi",
        "email": "ychris7899@gmail.com",
        "time": "Mon Apr 27 15:01:24 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 00:01:24 2026 -0700"
      },
      "message": "fix: Curator class conflict in ZookeeperBasedLockProvider (#18593)\n\nCo-authored-by: yuqi \u003cyuqi@bestpay.com.cn\u003e"
    },
    {
      "commit": "20a01051ca6d6ce51039089a7b7a2ff836378783",
      "tree": "d9a81258aad6796f2889c6f09bf6365d9cd85543",
      "parents": [
        "fdf27db0733edb1d894b8e1f389ce9afb7e6907f"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Mon Apr 27 00:00:39 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 00:00:39 2026 -0700"
      },
      "message": "feat(spark): add Spark 4.1 support (#17674)"
    },
    {
      "commit": "fdf27db0733edb1d894b8e1f389ce9afb7e6907f",
      "tree": "3d59d4ed0f7eee1258eb0e635d93b85e6b49690c",
      "parents": [
        "a8a6917ef5a4546ab89874ee32ab91af5452feda"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Mon Apr 27 14:14:30 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sun Apr 26 23:14:30 2026 -0700"
      },
      "message": "fix(vector): Preserve VECTOR/BLOB metadata on SQL INSERT path (#18540)"
    },
    {
      "commit": "a8a6917ef5a4546ab89874ee32ab91af5452feda",
      "tree": "2ff6d3b422cf4e0c4fda1ea1f860e291d36006ff",
      "parents": [
        "29f9c4034c07080fe651bac5586bdee551dd006f"
      ],
      "author": {
        "name": "Shihuan Liu",
        "email": "skywalker0618@gmail.com",
        "time": "Sun Apr 26 19:53:49 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 27 10:53:49 2026 +0800"
      },
      "message": "feat(flink): Vendor Flink 2.1 Dremel nested-reader support classes (#18567)"
    },
    {
      "commit": "29f9c4034c07080fe651bac5586bdee551dd006f",
      "tree": "c5ed772acd3d992da8ee406c0df4a27b88a684fb",
      "parents": [
        "787953fa52dcbea8d326185e94fd1faa52e91e3b"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Sat Apr 25 23:39:54 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat Apr 25 23:39:54 2026 -0700"
      },
      "message": "feat(ci): enable auto-merge and require all GitHub Actions checks on master (#18594)\n\nEnable auto-merge support and add required status checks for the master\nbranch protection to enforce all GitHub Actions CI checks pass before merge."
    },
    {
      "commit": "787953fa52dcbea8d326185e94fd1faa52e91e3b",
      "tree": "50a27a81c3ed7911613225e76047f8ba8ed7d94d",
      "parents": [
        "853064454656ba6aa74bc28c57a1872bd5b81c99"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Sat Apr 25 23:03:27 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat Apr 25 23:03:27 2026 -0700"
      },
      "message": "feat(blob): add support for lance blob inline descriptor reading (#18586)\n\nCo-authored-by: Y Ethan Guo \u003cethan.guoyihua@gmail.com\u003e"
    },
    {
      "commit": "853064454656ba6aa74bc28c57a1872bd5b81c99",
      "tree": "333cbd86fdfec12b581f75144cd5aedfdbfe40cd",
      "parents": [
        "4f3e885f75508e9e5c873aaa410ce2e515b22016"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Sat Apr 25 15:45:41 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat Apr 25 15:45:41 2026 -0700"
      },
      "message": "feat(lance): support simplified path for lance blob inline reading (#18575)"
    },
    {
      "commit": "4f3e885f75508e9e5c873aaa410ce2e515b22016",
      "tree": "f55faa9c8139ad51842cafa11dd788287141999d",
      "parents": [
        "c1569db7f7353885b485dc480ccb6ef0f480b3f6"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Fri Apr 24 20:51:26 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 20:51:26 2026 -0700"
      },
      "message": "fix(ci): bump surefire test heap from 3g to 4g (#18589)\n\nIncrease -Xmx from 3g to 4g in the default, java11 and java17\nMaven profiles to prevent OOM failures during CI test runs."
    },
    {
      "commit": "c1569db7f7353885b485dc480ccb6ef0f480b3f6",
      "tree": "5226daa7d7c1343b020eafeb0790c45e731dfbfa",
      "parents": [
        "436bd66834a5e86440bc4147bc35b003af9436ad"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Fri Apr 24 18:22:03 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 18:22:03 2026 -0700"
      },
      "message": "fix(clean): address review comments on empty clean support (#18587)"
    },
    {
      "commit": "436bd66834a5e86440bc4147bc35b003af9436ad",
      "tree": "ed9428dd80e2b4b0d18f365a52b63098e2d60778",
      "parents": [
        "2059c112b9af5e634c6dfbab5560c786e83f0c90"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Sat Apr 25 06:56:53 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 15:56:53 2026 -0700"
      },
      "message": "fix(vector): Pass plain FIXED through to VECTOR projection on Hive read (#18582)\n\nFixes: #18579\n\nHudi writes VECTOR as bare FIXED_LEN_BYTE_ARRAY with no Parquet logical-type\nannotation, so Hive\u0027s Parquet reader reconstructs the Avro schema as plain\nFIXED named after the column. Projecting that to the canonical VECTOR schema\nfailed with \"cannot support rewrite value for schema type\" because\nHoodieSchemaType.FIXED and VECTOR are distinct and rewritePrimaryTypeWithDiffSchemaType\nhad no VECTOR case. FIXED_BYTES-backed VECTOR is byte-identical to FIXED at\nmatching size, so pass the writable through."
    },
    {
      "commit": "2059c112b9af5e634c6dfbab5560c786e83f0c90",
      "tree": "282751ce14bae21b3877f42dd985090e651a16de",
      "parents": [
        "217e2a7a5400c234773e614afcdfc70624308b8a"
      ],
      "author": {
        "name": "Sivabalan Narayanan",
        "email": "n.siva.b@gmail.com",
        "time": "Fri Apr 24 10:27:09 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 10:27:09 2026 -0700"
      },
      "message": "feat(clean): Adding empty clean support to hudi (#18337)\n\nThis PR adds support for creating empty clean commits to optimize clean planning performance for append-only datasets.\n\nProblem: In datasets with incremental cleaning enabled that receive infrequent updates or are primarily append-only, the clean planner performs a full table scan on every ingestion run because there are\nno clean plans to mark progress. This leads to significant performance overhead, especially for large tables.\n\nSolution: Introduce a new configuration hoodie.write.empty.clean.internval.hours that allows creating empty clean commits after a configurable duration. These empty clean commits update the\nearliestCommitToRetain value, enabling subsequent clean planning operations to only scan partitions modified after the last empty clean, avoiding expensive full table scans.\n\nSummary and Changelog\nUser-facing changes:\nNew advanced config hoodie.write.empty.clean.create.duration.ms (default: -1, disabled) to control when empty clean commits should be created\nWhen enabled with incremental cleaning, Hudi will create empty clean commits after the specified duration (in milliseconds) to optimize clean planning performance\n\nDetailed changes:\n\nConfig Addition (HoodieCleanConfig.java):\n- Addedhoodie.write.empty.clean.internval.hours config property with builder method\n- Marked as advanced config for power users\nClean Execution (CleanActionExecutor.java):\n- Modified clean parallelism calculation to ensure minimum of 1 (was causing issues with empty plans)\n- Added createEmptyCleanMetadata() method to construct metadata for empty cleans\n- Updated runClean() to handle empty clean stats by creating appropriate metadata\nClean Planning (CleanPlanActionExecutor.java):\n- Added getEmptyCleanerPlan() method to construct cleaner plans with no files to delete\n- Modified requestClean() to return empty plans when partitions list is empty\n- Added logic in requestCleanInternal() to check if empty clean commit should be created based on:\nIncremental cleaning enabled\nTime since last clean \u003e configured threshold\nValid earliestInstantToRetain present\n\nImpact\nPerformance Impact: Positive - significantly reduces clean planning time for append-only or infrequently updated datasets by avoiding full table scans\n\nAPI Changes: None - purely additive configuration\n\nBehavior Changes:\n\nWhen enabled, users will see empty clean commits in the timeline at the configured intervals\nThese commits have totalFilesDeleted\u003d0 and empty partition metadata but contain valid earliestCommitToRetain metadata"
    },
    {
      "commit": "217e2a7a5400c234773e614afcdfc70624308b8a",
      "tree": "54df3529552da9a14e27dd1d90c6a705a28a5c08",
      "parents": [
        "edaa16820b7b103a7342506c196da1754d0fdfe2"
      ],
      "author": {
        "name": "Sivabalan Narayanan",
        "email": "n.siva.b@gmail.com",
        "time": "Fri Apr 24 10:20:43 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 10:20:43 2026 -0700"
      },
      "message": "feat: Adding support to inject custom configs to parquet writer (#18379)\n\nThis PR adds support for custom Parquet configuration injection across all file writer factories in Apache Hudi. This feature allows users to inject custom Parquet configurations (e.g., native Parquet bloom filters, custom compression settings, dictionary encoding overrides) at runtime without modifying Hudi\u0027s core code.\n\nMotivation: Users sometimes need to apply specific Parquet configurations for certain tables or partitions (e.g., disable dictionary encoding for high-cardinality columns, enable native Parquet bloom filters for specific columns, or apply custom encoding strategies). Previously, these configurations were hard-coded or required code changes. This PR introduces a pluggable mechanism via the HoodieParquetConfigInjector interface.\n\nSummary and Changelog\nSummary: Added support for custom Parquet configuration injection across Spark, Avro, and Flink file writers. Users can now implement\nthe HoodieParquetConfigInjector interface and specify it via the hoodie.parquet.config.injector.class configuration to inject custom\nParquet settings at write time.\n\nChanges:\n\nCore Implementation (hudi-client-common):\n- Added HoodieParquetConfigInjector interface with withProps() method that accepts StoragePath, StorageConfiguration, and HoodieConfig\nand returns modified configurations\nSpark Integration (hudi-spark-client):\n- Modified HoodieSparkFileWriterFactory.newParquetFileWriter() to check for and invoke config injector (lines 66-79)\n- Added comprehensive tests in TestHoodieParquetConfigInjector:\ntestDisableDictionaryEncodingViaInjector() - validates dictionary encoding can be disabled\ntestInvalidInjectorClassThrowsException() - validates error handling\ntestNoInjectorUsesDefaultConfig() - validates backward compatibility\n- Tests validate actual Parquet metadata (encodings) rather than just configuration\nAvro Integration (hudi-hadoop-common):\n- Modified HoodieAvroFileWriterFactory.newParquetFileWriter() to support config injection (lines 71-85)\n- Updated getHoodieAvroWriteSupport() signature to accept StorageConfiguration\n- Added TestHoodieAvroParquetConfigInjector with similar test coverage\nFlink Integration (hudi-flink-client):\n- Modified HoodieRowDataFileWriterFactory.newParquetFileWriter() to support config injection (lines 126-140)\n- Added TestHoodieRowDataParquetConfigInjector with similar test coverage\nConfiguration:\n- Added HOODIE_PARQUET_CONFIG_INJECTOR_CLASS config key in HoodieStorageConfig\n- Added withParquetConfigInjectorClass() builder method\n\nImpact\nPublic API:\n\nNew interface: HoodieParquetConfigInjector (marked with appropriate annotations)\nNew configuration: hoodie.parquet.config.injector.class (optional, defaults to empty string)\n\nUser-facing changes:\nUsers can now customize Parquet settings per file/partition without modifying Hudi code\nFully backward compatible - existing code continues to work without changes\nNo performance impact when feature is not used (single isNullOrEmpty() check)"
    },
    {
      "commit": "edaa16820b7b103a7342506c196da1754d0fdfe2",
      "tree": "bda9c92c9a66a8916077e2ce0be4678f013e15f1",
      "parents": [
        "2092890af2925c6db5e4ef458f30e3608931ff71"
      ],
      "author": {
        "name": "tiennguyen-onehouse",
        "email": "tien@onehouse.ai",
        "time": "Thu Apr 23 20:11:22 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 11:11:22 2026 +0800"
      },
      "message": "fix: FileGroupReader drops mandatory partition columns from dataSchema (#18570)\n\nHoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds\ntwo schemas side-by-side: requestedSchema (what to return to Spark) and\ndataSchema (what to read from parquet). It augments requestedSchema with\nany partition fields in mandatoryFields before pruning, but pipes\ndataStructType through unchanged. Spark\u0027s dataStructType excludes\npartition columns by convention, and HoodieSchemaUtils.pruneDataSchema\niterates over its second arg, so any mandatory partition field is\nsilently dropped from the resulting dataSchema. The FileGroupReader then\ndoes not read the column from the parquet base file, and for\nnon-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload)\nthe output converter writes null for every affected row.\n\nMost visible on MOR file slices that have both a base file and a log\nfile, since the readBaseFile path (which would append partition values\nfrom the directory name) is skipped in favor of the FileGroupReader path.\n\nRegression introduced by #13711 (\"Improve Logical Type Handling on Col\nStats\"), which added the pruneDataSchema wrapping but only on the\nrequested-schema side.\n\nFix: mirror requestedStructType\u0027s construction — augment dataStructType\nwith the mandatory partition fields before pruning.\n\nAlso adds a regression test (TestFileGroupReaderPartitionColumn) that\nreproduces the scenario end-to-end: MOR + CustomKeyGenerator +\nPostgresDebeziumAvroPayload + GLOBAL_SIMPLE with\nupdate.partition.path\u003dtrue, round-2 partition-key change producing a\nbase+log slice, then verifies untouched records in that slice read back\nwith the correct partition-column value.\n\nFixes: #18568\n\nSigned-off-by: tiennguyen-onehouse \u003ctien@onehouse.ai\u003e"
    },
    {
      "commit": "2092890af2925c6db5e4ef458f30e3608931ff71",
      "tree": "7b0c8f035810bbc11844ac10c0a48ac284bc8046",
      "parents": [
        "110b9be76634f31f71f71b3b5265d9180a46b049"
      ],
      "author": {
        "name": "tiennguyen-onehouse",
        "email": "tien@onehouse.ai",
        "time": "Thu Apr 23 19:47:51 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 24 10:47:51 2026 +0800"
      },
      "message": "fix: ProtoConversionUtil$AvroSupport static init under Avro 1.12 (#18571)\n\nRECURSION_OVERFLOW_SCHEMA passes new byte[0] (via getUTF8Bytes(\"\")) as\nthe default value of a BYTES HoodieSchemaField, which wraps an Avro\nSchema.Field. Avro 1.12.0\u0027s Schema.validateDefault rejects byte[] for\nBYTES defaults — it now requires a String (interpreted as ISO-8859-1\nbytes, Avro\u0027s canonical JSON form for BYTES defaults). Under 1.11.x the\nvalidator was lenient.\n\nBecause the failure is in a static initializer, the JVM caches the\nExceptionInInitializerError and every subsequent class access throws\nNoClassDefFoundError: Could not initialize class\norg.apache.hudi.utilities.sources.helpers.ProtoConversionUtil$AvroSupport.\nAny downstream service loading this class (e.g. DeltaStreamer using\nProtoClassBasedSchemaProvider) is permanently broken on JVMs that have\nAvro 1.12+ on the classpath, even though Hudi itself still pins 1.11.4.\n\nFix: use \"\" (empty string) instead of getUTF8Bytes(\"\"). The default\nvalue is never read at runtime — it is always overwritten by\nByteBuffer.wrap(messageValue.toByteArray()) at line ~365 before the\nfield is populated — and \"\" serializes bit-for-bit identically to\nnew byte[0] on the wire (both produce \"default\":\"\" in the JSON schema),\nso behavior is strictly preserved under Avro 1.11 while satisfying\n1.12\u0027s stricter validator.\n\nRejected alternatives (experimental testing against both Avro versions):\n| default arg              | 1.11.x | 1.12.0 |\n| new byte[0] (current)    | OK     | FAIL   |\n| \"\" (fix)                 | OK     | OK     |\n| ByteBuffer.wrap(...)     | OK     | FAIL   |\n| null                     | OK     | OK, but strips the default entirely |\n\nAlso drops the now-unused getUTF8Bytes static import.\n\nFixes: #18569\n\nSigned-off-by: tiennguyen-onehouse \u003ctien@onehouse.ai\u003e"
    },
    {
      "commit": "110b9be76634f31f71f71b3b5265d9180a46b049",
      "tree": "882f2f9b57113a8f08052e04173ae0c0df6c6d1c",
      "parents": [
        "9d1f81744bd716817b2aa83430b63a0f69f3e4c1"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Fri Apr 24 10:18:43 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 19:18:43 2026 -0700"
      },
      "message": "fix(variant): allow VariantType writes through Hudi\u0027s V1 DataSource on Spark 4 (#18564)\n\nCo-authored-by: Y Ethan Guo \u003cethan.guoyihua@gmail.com\u003e"
    },
    {
      "commit": "9d1f81744bd716817b2aa83430b63a0f69f3e4c1",
      "tree": "c4f99ff5acbfd865fd75c7bce96f338a12866afc",
      "parents": [
        "1e646621effebd02906256a9ba3e363b4ad769c3"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Fri Apr 24 10:18:17 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 19:18:17 2026 -0700"
      },
      "message": "fix(vector): Register VECTOR HMS column as BINARY on Spark CREATE (#18545)\n\nCo-authored-by: Y Ethan Guo \u003cethan.guoyihua@gmail.com\u003e"
    },
    {
      "commit": "1e646621effebd02906256a9ba3e363b4ad769c3",
      "tree": "931c6cce911466c5a1bb49844494bd3117ddd0a3",
      "parents": [
        "ace2871c3718d17038fcb555928b002117fcdc4a"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Thu Apr 23 16:18:54 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 16:18:54 2026 -0700"
      },
      "message": "feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists (#18497)"
    },
    {
      "commit": "ace2871c3718d17038fcb555928b002117fcdc4a",
      "tree": "e1c680c8c266b56f77d7c1d18982a8a5d29021c5",
      "parents": [
        "7f4dd319616504e3973c68c29b042989fa03e4ad"
      ],
      "author": {
        "name": "Shuo Cheng",
        "email": "njucshuo@gmail.com",
        "time": "Thu Apr 23 17:26:51 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 17:26:51 2026 +0800"
      },
      "message": "feat(flink): Introduces dictionary encoding of payload partition path for RocksDBIndexBackend (#18560)"
    },
    {
      "commit": "7f4dd319616504e3973c68c29b042989fa03e4ad",
      "tree": "0d73cad384d10b19676dea90588ad5849ff10c93",
      "parents": [
        "ddbdbb944a2178ab7bc2c4c746c12c119810833d"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Thu Apr 23 01:36:15 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 01:36:15 2026 -0700"
      },
      "message": "fix(lance): Add Hive InputFormat stubs and fix Spark SQL for Lance file format (#18162)\n\nCo-authored-by: Y Ethan Guo \u003cethan.guoyihua@gmail.com\u003e"
    },
    {
      "commit": "ddbdbb944a2178ab7bc2c4c746c12c119810833d",
      "tree": "cf9a818641929132bfc8458417b8962fb6f74dfb",
      "parents": [
        "86238986b36b7f5c2cb06167f956596ffa52a69e"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Thu Apr 23 10:24:16 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 22 19:24:16 2026 -0700"
      },
      "message": "chore(spark): bump spark4.version to 4.0.2 (#18549)"
    },
    {
      "commit": "86238986b36b7f5c2cb06167f956596ffa52a69e",
      "tree": "8f25b959cd210c9e51240e446829925eeafb9c5e",
      "parents": [
        "4260914c8265fe3da51f127880a2760b8d9be4e4"
      ],
      "author": {
        "name": "Venkateswarlu Boggavarapu",
        "email": "mailtoboggavarapu@gmail.com",
        "time": "Wed Apr 22 22:00:42 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 10:00:42 2026 +0800"
      },
      "message": "fix: JDBC connection leak in HiveIncrementalPuller.saveDelta() (#18460)"
    },
    {
      "commit": "4260914c8265fe3da51f127880a2760b8d9be4e4",
      "tree": "9cdddb545e8fe69834b3b1e3d6ca3f307aeb0e68",
      "parents": [
        "cd83cf4d40026d50dcf6859ac20ec15dc6e69e18"
      ],
      "author": {
        "name": "Shihuan Liu",
        "email": "skywalker0618@gmail.com",
        "time": "Wed Apr 22 18:59:10 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 23 09:59:10 2026 +0800"
      },
      "message": "fix: Parquet small-precision decimals decode ClassCastException (#18552)\n\n* fix: Parquet small-precision decimals decode ClassCastException\n\nParquet encodes DECIMAL physically as INT32 (precision \u003c\u003d 9), INT64\n(precision \u003c\u003d 18), or BINARY / FIXED_LEN_BYTE_ARRAY. The Hudi-Flink 1.18\nParquetDecimalVector predated this and unconditionally cast the child\nvector to BytesColumnVector inside getDecimal(), throwing\nClassCastException whenever the reader materialized a small-precision\ndecimal as a HeapIntVector / HeapLongVector.\n\nSync ParquetDecimalVector with Apache Flink 2.1\u0027s implementation:\n - getDecimal dispatches on the physical child vector type\n   (ParquetSchemaConverter#is32BitDecimal / is64BitDecimal) and decodes\n   int / long / bytes accordingly.\n - Wrapper additionally implements WritableLongVector,\n   WritableIntVector, and WritableBytesVector by delegation so that\n   column readers can write through it without unwrapping.\n - Underlying vector field is now private; accessed via getVector().\n   Migrated ArrayColumnReader\u0027s nine direct-field accesses.\n - Adds TestParquetDecimalVector covering every dispatch path, the\n   backward-compatible bytes-at-small-precision case, unsupported\n   vector types, null handling, and the Writable* delegation contracts."
    },
    {
      "commit": "cd83cf4d40026d50dcf6859ac20ec15dc6e69e18",
      "tree": "6c8118d2491799b35cf0627538825f270c455be0",
      "parents": [
        "4ef56e4ebd79ccaa27f5b86d5cab01eea06e46c5"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Thu Apr 23 07:56:03 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 22 16:56:03 2026 -0700"
      },
      "message": "chore(docker): add Hadoop 3.4.0 / Hive 2.3.10 / Spark 4.0.2 compose stack (#18550)"
    },
    {
      "commit": "4ef56e4ebd79ccaa27f5b86d5cab01eea06e46c5",
      "tree": "c181a4e33c0bba5f4a02a87a6d6be97f419cb2ad",
      "parents": [
        "e4904ba38f282dfb3f12b6971f1e1911c15443c2"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Wed Apr 22 16:32:36 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 22 16:32:36 2026 -0700"
      },
      "message": "feat(blob): followup fixes for blob reader (#18538)"
    },
    {
      "commit": "e4904ba38f282dfb3f12b6971f1e1911c15443c2",
      "tree": "4caabca4d9ab03735158c9c76aafe742851c706c",
      "parents": [
        "0d5743510b679e0c759dffaf52eb8d11d209bf69"
      ],
      "author": {
        "name": "Lokesh Jain",
        "email": "ljain@apache.org",
        "time": "Wed Apr 22 20:50:18 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 22 08:20:18 2026 -0700"
      },
      "message": "feat: Add support for exclusive rollbacks with multi writer (#18448)\n\nCloses #18050\n\nIn multi-writer mode, when multiple writers detect a failed inflight commit, each writer independently schedules and executes its own rollback. This leads to duplicate rollback instants on the timeline for the same failed commit — causing unnecessary work, potential conflicts, and timeline clutter.\n\nSummary and Changelog\nAdds a mechanism to avoid duplicate rollback plans in multi-writer mode by:\n\nTimeline reload under lock: Before scheduling a new rollback, the writer reloads the active timeline inside the lock to check if another writer already scheduled a rollback for the same failed commit. If found, it reuses the existing plan instead of creating a new one.\nHeartbeat-based ownership: Before executing a rollback, the writer acquires a heartbeat for the rollback instant under a transaction lock. If another writer already holds an active heartbeat (i.e., is currently executing the rollback), the current writer skips execution. If the heartbeat is expired, the writer takes ownership and proceeds.\nCompleted rollback detection: Inside the heartbeat acquisition lock, the writer also checks if the rollback was already completed on the timeline by another writer, and skips if so.\n\nChanges:\n\nBaseHoodieTableServiceClient: Refactored rollback() into schedule and execute phases. Extracted resolveOrScheduleRollback() (reuses existing pending rollbacks or schedules new ones under lock) and acquireRollbackHeartbeatIfMultiWriter() (heartbeat-based ownership with completed-rollback detection). Wrapped heartbeatClient.stop() in try-catch in the finally block to avoid masking rollback exceptions.\nHoodieWriteConfig: New advanced config hoodie.rollback.avoid.duplicate.plan (default false), gated on multi-writer mode via shouldAvoidDuplicateRollbackPlan().\nTestClientRollback: Added 6 tests covering: expired heartbeat takeover, active heartbeat skip, commit-not-in-timeline, first-writer-schedules-new-plan, already-completed-by-another-writer, and concurrent two-writer rollback of the same commit.\n\nImpact\nNew advanced config: hoodie.rollback.avoid.duplicate.plan — opt-in, default false, only effective in multi-writer mode.\nNo breaking changes to public APIs or storage format.\nNo behavioral change for single-writer mode or when the config is disabled.\n\n---------\n\nCo-authored-by: Lokesh Jain \u003cljain@Lokeshs-MacBook-Pro.local\u003e\nCo-authored-by: sivabalan \u003cn.siva.b@gmail.com\u003e"
    },
    {
      "commit": "0d5743510b679e0c759dffaf52eb8d11d209bf69",
      "tree": "6f69a8858e52ca111fafc8beb8c6b4e69c81661d",
      "parents": [
        "e30357927a5e71195c4d6a1eb6c9ba5c030c06a8"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Wed Apr 22 13:55:12 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 22 13:55:12 2026 +0800"
      },
      "message": "fix: VARIANT Hive sync error when performing CREATE table DDL (#18511)\n\n- Hive 2.x/3.x does not support VARIANT type natively.\n- When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark\u0027s HiveClient passes \"variant\" as a literal type string which Hive rejects.\n- Convert VariantType to struct\u003cvalue:binary, metadata:binary\u003e in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading.\n- Includes unit test for the conversion.\n- Recursively convert VariantType inside nested StructType/ArrayType/MapType so columns like STRUCT\u003ca:VARIANT\u003e, ARRAY\u003cVARIANT\u003e, and MAP\u003cSTRING,VARIANT\u003e are also rewritten to the Hive-compatible physical struct.\n- Emit the variant struct with canonical (metadata, value) field order to match HoodieSchema.createVariant() and the Parquet/Iceberg convention.\n- Extract buildHiveCompatibleCatalogTable helper so the schema conversion and property merge are directly unit-testable.\n- Expand TestVariantDataType with nested-variant cases, canonical-order assertions, and coverage for buildHiveCompatibleCatalogTable.\n- Clean up Scaladoc (use backticks) and rename the test field from v to variant_col."
    },
    {
      "commit": "e30357927a5e71195c4d6a1eb6c9ba5c030c06a8",
      "tree": "54bacd6ff059a5952f345c6eb429a56785869537",
      "parents": [
        "f9dead0e57a8e1fa299e28d608be9b6769adb0e7"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Tue Apr 21 23:38:37 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue Apr 21 08:38:37 2026 -0700"
      },
      "message": "fix: prevent parseTypeDescriptor crash for VARIANT (#18510)\n\n* fix: prevent parseTypeDescriptor crash for non-custom logical types in schema conversion guards\n\n- The BLOB/VECTOR guard conditions in HoodieSparkSchemaConverters called parseTypeDescriptor() for any StructType/ArrayType with hudi_type metadata, which threw IllegalArgumentException for types like VARIANT that are not custom logical types.\n- Add isCustomLogicalTypeDescriptor() safe check to short-circuit the guards before parseTypeDescriptor() is called.\n- Add regression test that reproduces the struct+metadata VARIANT path.\n\n* fix: treat hudi_type\u003dVARIANT as a first-class custom logical type\n\n- Address review feedback on #18510, restructure the crash fix so a StructType tagged with hudi_type\u003dVARIANT is handled consistently with BLOB/VECTOR.\n- The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal.\n- Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB.\n- In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant.\n- On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path.\n- Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check.\n- Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection.\n\n* Address missing code coverage complains"
    },
    {
      "commit": "f9dead0e57a8e1fa299e28d608be9b6769adb0e7",
      "tree": "dd35d7f0b0b7c853416e32955eb06780c04e39bb",
      "parents": [
        "59fee58a71a264b546a7a159981d679259bd6aba"
      ],
      "author": {
        "name": "Sivabalan Narayanan",
        "email": "n.siva.b@gmail.com",
        "time": "Tue Apr 21 07:32:41 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue Apr 21 07:32:41 2026 -0700"
      },
      "message": "feat: Adding support to block archival on last known ECTR for v6 tables (#18380)\n\nThis PR adds support to block archival based on the Earliest Commit To Retain (ECTR) from the last completed clean operation, preventing potential data leaks when cleaning configurations change between clean and archival runs.\n\nProblem: Currently, archival recomputes ECTR independently based on cleaning configs at archival time, rather than reading it from the last clean plan. When cleaning configs change between clean and archival operations, archival may archive commits whose data files haven\u0027t been cleaned yet, leading to timeline metadata loss for existing data files.\n\nSummary and Changelog\nUser-facing summary: Users can now optionally enable archival blocking based on ECTR from the last clean to prevent archiving commits whose data files haven\u0027t been cleaned. This is useful when cleaning configurations may change over time or when strict data retention guarantees are needed.\n\nDetailed changelog:\n\nConfiguration Changes:\n\nAdded new advanced config hoodie.archive.block.on.latest.clean.ectr (default: false)\nWhen enabled, archival reads ECTR from last completed clean metadata\nBlocks archival of commits with timestamp \u003e\u003d ECTR\nMarked as advanced config for power users\nAvailable since version 1.2.0\n\nImplementation Changes:\n\nTimelineArchiverV1.java: Added ECTR blocking logic in getCommitInstantsToArchive() method\nReads ECTR from last completed clean\u0027s metadata (lines 274-294)\nFilters commit timeline to exclude commits \u003e\u003d ECTR (lines 322-326)\nFollows same pattern as existing compaction/clustering retention checks\nIncludes error handling with graceful degradation (logs warning if metadata read fails)\nHoodieArchivalConfig.java: Added config property BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR\nBuilder method: withBlockArchivalOnCleanECTR(boolean)\nHoodieWriteConfig.java: Added access method shouldBlockArchivalOnCleanECTR()"
    },
    {
      "commit": "59fee58a71a264b546a7a159981d679259bd6aba",
      "tree": "655494b720c99c5f43da59456321d6fd8bdaa9df",
      "parents": [
        "a83473689eaee9b63211bd07bcbfb87b50304c08"
      ],
      "author": {
        "name": "Rahil C",
        "email": "32500120+rahil-c@users.noreply.github.com",
        "time": "Tue Apr 21 05:46:05 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue Apr 21 20:46:05 2026 +0800"
      },
      "message": "feat(lance): Bump lance to 4.0.0 and lance-spark to 0.4.0 (#18498)\n\n* [MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0\n\nBumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector\nfrom 0.0.15 to 0.4.0. Updates affected import paths and adapts to\nthe LanceArrowUtils.toArrowSchema signature change (drops the\nerrorOnDuplicatedFieldNames parameter).\n\n* [MINOR] Rename Hudi\u0027s ShowIndexes logical plan to HoodieShowIndexes\n\nLance-spark 0.4.0 (bumped in 7e4967cd2822) ships its own\n`org.apache.spark.sql.catalyst.plans.logical.ShowIndexes` inside\n`lance-spark-base_*.jar`. This collides with Hudi\u0027s own same-FQCN\ncase class (added in hudi-spark-common). Both jars end up on the\nclasspath of hudi-spark3.3.x/3.4.x/3.5.x/4.0.x, and since the two\nclasses have different case-class arity (Lance\u0027s is 1-arg, Hudi\u0027s\nis 2-arg), Scala pattern matches like `case ShowIndexes(table, output)`\nfail to compile.\n\nRename Hudi\u0027s class to `HoodieShowIndexes` (and its companion\nobject) to sidestep the collision. This is an internal logical-plan\nclass consumed only by Hudi\u0027s own parser / CatalystPlanUtils /\nanalyzer — no public SQL or API surface changes.\n\nCall-sites updated:\n- Index.scala (definition + companion)\n- HoodieSpark{33,34,35,40}CatalystPlanUtils.scala (pattern match)\n- HoodieSpark{3_3,3_4,3_5,4_0}ExtendedSqlAstBuilder.scala (construct)\n- IndexCommands.scala (doc reference)\n\nCo-Authored-By: Claude Opus 4.6 \u003cnoreply@anthropic.com\u003e\n\n* fix(lance): look up Arrow vectors by field name in LanceRecordIterator\n\nWith lance-spark 0.4.0, VectorSchemaRoot.getFieldVectors() returns\nvectors in the file\u0027s on-disk order rather than in the order of the\nprojection requested via LanceFileReader.readAll(). Wrapping vectors\npositionally therefore mismatches the UnsafeProjection built from the\nrequested schema, causing UnsafeProjection to call type accessors on\nthe wrong column (e.g. getInt on a VarCharVector) and fail with\nUnsupportedOperationException for MoR reads where the FileGroupRecord\nBuffer rearranges columns relative to the file\u0027s write order.\n\nFix by looking up each vector by field name from the requested schema\nso the ColumnVector[] order matches what UnsafeProjection expects.\n\nCo-Authored-By: Claude Opus 4.6 \u003cnoreply@anthropic.com\u003e\n\n* [MINOR] Tighten comments on LanceRecordIterator and HoodieShowIndexes\n\nShorten the prose blocks above the column-order remapping in\nLanceRecordIterator and above HoodieShowIndexes to 2-3 sentences each,\nkeeping the why (lance-spark 0.4.0 on-disk column order; FQCN shadow\nfrom lance-spark-base) without the full incident narrative.\n\nCo-Authored-By: Claude Opus 4.6 \u003cnoreply@anthropic.com\u003e\n\n---------\n\nCo-authored-by: Claude Opus 4.6 \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "a83473689eaee9b63211bd07bcbfb87b50304c08",
      "tree": "3769ae3b911f47744f00c62bf84583fb7a6f749d",
      "parents": [
        "76a0a27e39be72801f5f003f83c1a56a77baa616"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Tue Apr 21 10:41:49 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 19:41:49 2026 -0700"
      },
      "message": "feat(vector): Add Spark SQL DDL CREATE TABLE support for VECTOR type (#18488)\n\n* feat(vector): Add Spark SQL DDL CREATE TABLE support for VECTOR type\n\n- Enable VECTOR(dim[, elementType]) syntax in Spark SQL DDL so users can create tables with vector columns directly via SQL instead of only through the DataFrame API.\n\n- Changes:\n  - Extend ANTLR grammar to accept identifier params in primitiveDataType\n  - Add VECTOR case in visitPrimitiveDataType (supports FLOAT, DOUBLE, INT8)\n  - Add VECTOR metadata attachment in addMetadataForType\n  - Add DDL tests for VECTOR columns in TestCreateTable\n\n* Fix tests and address comments\n\n* Improve VECTOR DDL test coverage with targeted tests and routing\n\n- Relax isHoodieCommand VECTOR check from \" vector(\" to \" vector\" in all 4 extended parser files.\n- The stricter \" vector(\" variant only routes SQL containing VECTOR type declarations with parentheses (e.g. VECTOR(128)), which means VECTOR without parens is delegated to Spark\u0027s native parser and never reaches our Hudi code path.\n- Relaxing to \" vector\" routes all VECTOR-related SQL through our parser, enabling us to exercise the \"vector with empty params\" branch of the `case (\"vector\", _ :: _)` pattern - previously reported as partial coverage because the empty-list side of the `_ :: _` check was never hit.\n- This is also consistent with the existing BLOB routing pattern \" blob\".\n\nAdd two targeted tests:\n1. test create table with INT8 VECTOR column - isolated INT8 test that independently exercises the `case INT8 \u003d\u003e ByteType` branch\n2. test create table with VECTOR without dimension fails - routes VECTOR alone through the Hudi parser to cover the empty-list branch"
    },
    {
      "commit": "76a0a27e39be72801f5f003f83c1a56a77baa616",
      "tree": "d723d0c05ac3337d40426ee406c4df761463493c",
      "parents": [
        "adf29acc2edf4d30ce55366b5c49ad28bc5c6a31"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Tue Apr 21 10:29:57 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue Apr 21 10:29:57 2026 +0800"
      },
      "message": "perf(common): Avoid double-iterating log files in file-system-view filters (#18531)\n\n* perf(common): Avoid double-iterating log files in file-system-view filters\n\n- filterUncommittedFiles and filterUncommittedLogs each called fileSlice.getLogFiles() twice: once to filter+collect into committedLogFiles, and again for .count() to compare sizes.\n- Each call produced a fresh stream over the underlying collection.\n- Materialize the log files once and compare against the resulting list size.\n- Also switch fileSlices.size() \u003d\u003d 0 to isEmpty() in fetchAllLogsMergedFileSlice.\n- No behaviour changes.\n\n* perf(common): add FileSlice#getLogFileCnt and drop intermediate list\n\n- Address review comment on #18531:\n  - The backing TreeSet\u0027s size() is O(1), so expose it as FileSlice#getLogFileCnt and use it for the size comparison in filterUncommittedFiles / filterUncommittedLogs.\n- This removes the allLogFiles materialization and keeps a single stream pass for the filter."
    },
    {
      "commit": "adf29acc2edf4d30ce55366b5c49ad28bc5c6a31",
      "tree": "f7f650f1074880d649fd37772d23faa622812631",
      "parents": [
        "f35b69cf028ec50dd13183e4272d1e28f16356de"
      ],
      "author": {
        "name": "mailtoboggavarapu-coder",
        "email": "mailtoboggavarapu@gmail.com",
        "time": "Mon Apr 20 22:21:04 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue Apr 21 10:21:04 2026 +0800"
      },
      "message": "fix: HoodieStorage resource leak in FileSystemBasedLockProvider.close() (#18461)\n\n* fix: close HoodieStorage resource in FileSystemBasedLockProvider.close()\n\nAdds a finally block to properly close the HoodieStorage instance\nafter the lock file is deleted, preventing resource leaks.\n\nFixes #14922\n\n* fix: simplify log message per review feedback\n\nAddresses nit from @yihua: shorten log to \"Failed to close HoodieStorage\" since logger already includes class context.\n\n* fix: add comment explaining HoodieStorage.close() is no-op for HadoopStorage\n\nAddresses review feedback from danny0405: HoodieHadoopStorage.close() is\na no-op because Hadoop FileSystem instances are shared within the JVM\nprocess lifecycle. Added a comment to explain this behavior and why the\ncall is still retained for interface contract correctness."
    },
    {
      "commit": "f35b69cf028ec50dd13183e4272d1e28f16356de",
      "tree": "7c1dc45b2e5b976dd9a4daf58ad354a713011434",
      "parents": [
        "0fb445496007ef1103077de0f2372706b0e90f10"
      ],
      "author": {
        "name": "Tim Brown",
        "email": "tim@generalintuition.com",
        "time": "Mon Apr 20 20:48:25 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 17:48:25 2026 -0700"
      },
      "message": "feat(blob): Read Blobs in Spark SQL (#18098)"
    },
    {
      "commit": "0fb445496007ef1103077de0f2372706b0e90f10",
      "tree": "f4d415a249a8a4ded8a7cd4e649a6e17878ba7ff",
      "parents": [
        "3a387da0cc7d675a6544c27e526456a2dadf5b93"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Mon Apr 20 09:52:12 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 09:52:12 2026 -0700"
      },
      "message": "feat(utilities): add external HudiHiveSyncJob for on-demand Hive sync (#18204)\n\n- Added HudiHiveSyncJob under hudi-utilities as an external runner for Hive sync.\n- Added CLI/config support for base path, base file format, props file, and override configs.\n- Wired the job to build sync properties and invoke HiveSyncTool directly.\n\n---------\n\nCo-authored-by: sivabalan \u003cn.siva.b@gmail.com\u003e"
    },
    {
      "commit": "3a387da0cc7d675a6544c27e526456a2dadf5b93",
      "tree": "d6ad73789104db2a51dc1635dc3e5aba583423f8",
      "parents": [
        "95199f0da1d1dd7258475011980f369d7de07a13"
      ],
      "author": {
        "name": "Prashant Wason",
        "email": "pwason@uber.com",
        "time": "Mon Apr 20 08:09:07 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 08:09:07 2026 -0700"
      },
      "message": "feat(flink): Implement continuous sorting feature for append write (#18083)\n\n* Implemented a continuous sorting mode for the append sink to maintain sorted order incrementally and avoid single-partition lag during ingestion by reducing large pause time from sort and backpressure\n\nSummary:\n- Added AppendWriteFunctionWithContinuousSort which keeps records in a TreeMap keyed by a code-generated normalized key and an insertion sequence, drains oldest entries when a configurable threshold is reached, and writes drained records immediately; snapshot/endInput drain remaining records.\n- Updated AppendWriteFunctions.create to instantiate the continuous sorter when WRITE_BUFFER_SORT_CONTINUOUS_ENABLED is true.\n- Introduced three new FlinkOptions: WRITE_BUFFER_SORT_CONTINUOUS_ENABLED, WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_THRESHOLD_PERCENT, and WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_SIZE, and added runtime validation (buffer \u003e 0, 0 \u003c threshold \u003c 100, drainSize \u003e 0, parsed non-empty sort keys).\n- Added ITTestAppendWriteFunctionWithContinuousSort integration tests covering buffer flush triggers, sorted output correctness (with and without continuous drain), drain threshold/size behaviors, and invalid-parameter error cases.\n\n---------\n\nCo-authored-by: dsaisharath \u003cdsaisharath@uber.com\u003e\nCo-authored-by: Claude Opus 4.6 \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "95199f0da1d1dd7258475011980f369d7de07a13",
      "tree": "7fe048220655093ed08f601904457b75bd33f768",
      "parents": [
        "91dba3e16dec11cef8fb5d7e7285916569f38681"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Mon Apr 20 18:14:16 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 18:14:16 2026 +0800"
      },
      "message": "perf(common): avoid stream allocation in CollectionUtils.createImmutableList (#18530)\n\n- Replace Stream.of(elements).collect(Collectors.toList()) with a direct Arrays.asList copy.\n- Same semantics (mutable snapshot wrapped in an unmodifiable view), one less stream/spliterator/collector allocation per call.\n- Called from constants/initializers, so the saving is per-call but free."
    },
    {
      "commit": "91dba3e16dec11cef8fb5d7e7285916569f38681",
      "tree": "0b311e81657a73940c6985ecc215190cb89ace9e",
      "parents": [
        "cfb98336cf140aec4a573476e79974fefd7061a7"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Mon Apr 20 10:37:57 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 10:37:57 2026 +0800"
      },
      "message": "chore(common): Consolidate MapUtils into CollectionUtils (#18529)\n\n- MapUtils.isNullOrEmpty(Map) was byte-identical to CollectionUtils.isNullOrEmpty(Map), and MapUtils.nonEmpty(Map) had no matching overload in CollectionUtils.\n- Move the sole remaining unique helper (containsAll) into CollectionUtils, delete MapUtils, and fold its test into TestCollectionUtils so there is one canonical utility class for Map/Collection emptiness and containment checks."
    },
    {
      "commit": "cfb98336cf140aec4a573476e79974fefd7061a7",
      "tree": "fe56ca31782840138678679d818d35f7919557fe",
      "parents": [
        "937a64a2f4c5d96a5c6760e5646ed3017debeec2"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Mon Apr 20 10:13:16 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 10:13:16 2026 +0800"
      },
      "message": "chore(docker): Remove duplicate yarn.nodemanager.bind-host in entrypoint.sh (#18527)\n\n- The property was inserted into yarn-site.xml twice with the same value in the MULTIHOMED_NETWORK\u003d1 block.\n- Duplicates are harmless at runtime (Hadoop\u0027s Configuration parser takes the last value for duplicates and both writes use the same value), but the second write is dead code.\n- Applies to base, base_java11, and base_java17."
    },
    {
      "commit": "937a64a2f4c5d96a5c6760e5646ed3017debeec2",
      "tree": "d0918864ef5abc0eb6280303013c8f8b2a767821",
      "parents": [
        "55bf91a451bb09fe730b84ded15b44dce9d3c23b"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Mon Apr 20 09:54:48 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 09:54:48 2026 +0800"
      },
      "message": "fix: whitelist Flink _2.12 artifacts in scala-2.13 enforcer rule (#18508)\n\n* fix: whitelist Flink _2.12 artifacts in scala-2.13 enforcer rule\n\n- Flink only publishes flink-table-planner_2.12 (no _2.13 variant exists). The scala-2.13 profile\u0027s blanket ban on *_2.12 artifacts breaks the build when combining -Dspark4.0 -Dscala-2.13 with -Dflink1.20.\n- Add \u003cincludes\u003e exceptions for org.apache.flink:*_2.12 and its transitive _2.12 dependencies (org.scala-lang.modules, com.twitter) since Flink has largely decoupled from Scala since 1.15 and the _2.12 suffix is internal.\n\n* fix: tighten Flink _2.12 whitelist and add rationale comment\n\nAddress review feedback on #18508:\n- Narrow scala-lang.modules and com.twitter includes to the specific transitives (scala-xml_2.12, chill_2.12) so the enforcer still catches unexpected _2.12 leakage from those groups.\n- Add an XML comment explaining why these _2.12 artifacts are whitelisted despite the blanket scala-2.13 ban (Flink\u0027s _2.12 suffix is a legacy naming artifact post 1.15 Scala decoupling)."
    },
    {
      "commit": "55bf91a451bb09fe730b84ded15b44dce9d3c23b",
      "tree": "a389952613d7c97423defadc213cb00fcf5cb151",
      "parents": [
        "eaaae8a4f9abf57f3b06da0420d37a084ebfa3e5"
      ],
      "author": {
        "name": "aaaZayne",
        "email": "1138069338@qq.com",
        "time": "Mon Apr 20 06:38:26 2026 +1000"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 20 04:38:26 2026 +0800"
      },
      "message": "introduce static helper method to remove clones (#18533)"
    },
    {
      "commit": "eaaae8a4f9abf57f3b06da0420d37a084ebfa3e5",
      "tree": "22aa503f1a5a0bb4e1e44fff9a578f7b016c0fed",
      "parents": [
        "3d0ab800de5ea2e476e014d93bc4f70a247d2baf"
      ],
      "author": {
        "name": "chrevanthreddy",
        "email": "27821245+chrevanthreddy@users.noreply.github.com",
        "time": "Sat Apr 18 13:04:57 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sat Apr 18 10:04:57 2026 -0700"
      },
      "message": "feat: Add Azure-based storage lock (#17951)\n\nCo-authored-by: Revanth Chandupatla \u003crevanth.chandupatla@walmart.com\u003e\nCo-authored-by: Y Ethan Guo \u003cethan.guoyihua@gmail.com\u003e"
    },
    {
      "commit": "3d0ab800de5ea2e476e014d93bc4f70a247d2baf",
      "tree": "7c42c4e47319e5f721d9c38f39fa6724cb33fb3d",
      "parents": [
        "9ddf58223922d0dd286329306f95007f8a514f60"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Sat Apr 18 04:56:07 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 13:56:07 2026 -0700"
      },
      "message": "chore(docker): bump integ-test docker-compose to Hive 2.3.10 (#18525)"
    },
    {
      "commit": "9ddf58223922d0dd286329306f95007f8a514f60",
      "tree": "e110c1c367aee9766096b1f36b8ba14b17e42593",
      "parents": [
        "97f9628e038e78d9155fc34df32df6d322f351cf"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Fri Apr 17 12:48:13 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 12:48:13 2026 -0700"
      },
      "message": "chore: add timing logs for file index partition and file listing (#18417)\n\nThis PR improves observability around file index partition and file listing paths in BaseHoodieTableFileIndex by adding timing logs and cache-miss diagnostics. It also includes a small optimization to return early when all requested partition paths are already present in the file-status cache, avoiding the metadata/file listing path in that case.\n\n\n---------\n\nCo-authored-by: sivabalan \u003cn.siva.b@gmail.com\u003e"
    },
    {
      "commit": "97f9628e038e78d9155fc34df32df6d322f351cf",
      "tree": "e756faac06bc401b9d52685b088283918592439e",
      "parents": [
        "7bcb8be7c153010998574f591c7ff3799192eda6"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Fri Apr 17 12:40:04 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 12:40:04 2026 -0700"
      },
      "message": "feat(docker): add --multi-arch flag for cross-platform image builds (#18522)"
    },
    {
      "commit": "7bcb8be7c153010998574f591c7ff3799192eda6",
      "tree": "5b5b8ebd0dc084da2c0c4a6a051e82a7367b0416",
      "parents": [
        "41cfc191f71be9da6d3fd22fc931699048e6a142"
      ],
      "author": {
        "name": "Krishen",
        "email": "22875197+kbuci@users.noreply.github.com",
        "time": "Fri Apr 17 11:26:54 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 11:26:54 2026 -0700"
      },
      "message": "feat(metadata): Allow users to safely execute compaction plans on metadata table concurrently through a table service platform (rather than only inline during write) (#18295)\n\nEnable support for executing metadata table (MDT) compaction/logcompaction plans from a concurrent writer that operates independently of the primary write path. Additionally, allow users to configure a table service manager to skip inline execution of compaction/logcompaction on the metadata table, so that these operations can be handled by a dedicated async \"table service platform\".\n\nSummary and Changelog\nSummary: Adds configuration-driven support for multi-writer concurrency on the metadata table and table service manager delegation of MDT compaction/logcompaction.\n\nChangelog:\n\nAdded hoodie.metadata.write.concurrency.mode config to HoodieMetadataConfig to control the write concurrency mode for the metadata table. When set to OPTIMISTIC_CONCURRENCY_CONTROL, the MDT write config inherits the lock configuration from the data table, enabling a concurrent writer to execute table service plans on the MDT.\nAdded hoodie.metadata.table.service.manager.enabled and hoodie.metadata.table.service.manager.actions configs to HoodieMetadataConfig, allowing users to delegate specific table service actions (compaction, logcompaction) on the metadata table to an external table service manager.\nWhen table service manager is enabled for an action, scheduling of compaction/logcompaction plans still proceeds normally, but inline execution is skipped — leaving the plans on the timeline for the table service manager or a concurrent writer to pick up.\nSimilarly, pending compaction/logcompaction plans from previous attempts are not executed inline when their action is delegated to the table service manager.\nApplied the same changes to both HoodieBackedTableMetadataWriter (table version 8+) and HoodieBackedTableMetadataWriterTableVersionSix.\nAdded a helper method to carefully extract lock-related properties from the data table\u0027s write config without overwriting other MDT-specific settings (e.g., base path).\n\nImpact\nNew configs: hoodie.metadata.write.concurrency.mode, hoodie.metadata.table.service.manager.enabled, hoodie.metadata.table.service.manager.actions — all marked as advanced with safe defaults (single writer, TSM disabled).\nNo changes to existing behavior when using default configuration.\nUsers who enable these configs can run MDT compaction/logcompaction from a separate pipeline without conflicting with the primary writer.\n\n\n\n---------\n\nCo-authored-by: Krishen Bhan \u003c“bkrishen@uber.com”\u003e"
    },
    {
      "commit": "41cfc191f71be9da6d3fd22fc931699048e6a142",
      "tree": "80b3c1e2417bc87c414b3123e152a3c8c1e4e44a",
      "parents": [
        "0356488034dc2a748c3eafaa6cb5f19f8db8daab"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Sat Apr 18 02:21:23 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 11:21:23 2026 -0700"
      },
      "message": "chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose setup (#18520)"
    },
    {
      "commit": "0356488034dc2a748c3eafaa6cb5f19f8db8daab",
      "tree": "1d0360a86434d2000d24602aa7485d3009770932",
      "parents": [
        "a3697738541ad68ad35a2b6cfb7fe71cdcabbc34"
      ],
      "author": {
        "name": "Y Ethan Guo",
        "email": "ethan.guoyihua@gmail.com",
        "time": "Fri Apr 17 10:52:21 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 10:52:21 2026 -0700"
      },
      "message": "fix(docker): fix docker image build with Java 11 and Hive 2.3.10 (#18519)"
    },
    {
      "commit": "a3697738541ad68ad35a2b6cfb7fe71cdcabbc34",
      "tree": "95b4adc9fac69b8e64bb58529570c324a2ddb5f0",
      "parents": [
        "c1af4f5f9a96b050a84b4336f888211d1747ae4e"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Sat Apr 18 01:07:55 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 10:07:55 2026 -0700"
      },
      "message": "chore: cleanup docker-compose files (#17950)\n\n- Remove unused exposed debug ports\n- Standardise environment variable notation\n- Add docker-compose_hadoop284_hive2310_spark353_amd64\n- Add docker-compose_hadoop284_hive2310_spark353_arm64\n- Updated outdated repos from bitnami/* -\u003e bitnamilegacy/*"
    },
    {
      "commit": "c1af4f5f9a96b050a84b4336f888211d1747ae4e",
      "tree": "9d080f76af79de13138fa6e0d4c1625f89e8ee76",
      "parents": [
        "2b33f5d407d6371fbe74dc3e304cc92ef45961c4"
      ],
      "author": {
        "name": "mailtoboggavarapu-coder",
        "email": "mailtoboggavarapu@gmail.com",
        "time": "Thu Apr 16 20:57:27 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 08:57:27 2026 +0800"
      },
      "message": "fix: use forward slash literal and remove unused import in DFSPropertiesConfiguration (#18454)\n\nReplaces File.separator with \"/\" (HDFS paths always use forward slash)\nand removes the now-unused import java.io.File.\n\nFixes #14922"
    },
    {
      "commit": "2b33f5d407d6371fbe74dc3e304cc92ef45961c4",
      "tree": "f91cc566f9913dd79d85e83ecd04ff91fcd9706d",
      "parents": [
        "a649188b6460d5a282d9675107ecbd439e401147"
      ],
      "author": {
        "name": "mailtoboggavarapu-coder",
        "email": "mailtoboggavarapu@gmail.com",
        "time": "Thu Apr 16 20:47:38 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri Apr 17 08:47:38 2026 +0800"
      },
      "message": "fix: fix BufferedReader resource leak in FileIOUtils.readAsUTFStringLines (#18470)"
    },
    {
      "commit": "a649188b6460d5a282d9675107ecbd439e401147",
      "tree": "9ccd20b8bcd718622ffd769cb88e513b5bb130de",
      "parents": [
        "f144abc5fce028d12c621bbea09232091474ae46"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Thu Apr 16 09:49:41 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 16 09:49:41 2026 -0700"
      },
      "message": "feat(spark): refresh parquet tools clustering strategy for current master (#18409)\n\nThis PR refreshes the parquet-tools based clustering strategy from the older parquet-tools branch so it can be proposed against current apache/master.\n\nThe original implementation had drifted from current Hudi internals and test APIs. This refresh keeps the existing simple rewrite hook shape while aligning the implementation with current clustering and storage behavior.\n\nSummary and Changelog\nRefresh the parquet-tools clustering strategy and its supporting tests for current master.\n\nkeep the ParquetToolsExecutionStrategy API simple with the existing file-to-file rewrite hook\ngenerate a new output file id for clustering rewrites instead of reusing the source file id\nmigrate helper code to current StoragePath / HoodieStorage based APIs\nreplace brittle previous-commit extraction with FSUtils.getCommitTime(...)\nupdate write-status generation to use current parquet/storage utilities\nrefresh the related tests to match current writer, meta client, and clustering strategy APIs\n\n---------\n\nCo-authored-by: sivabalan \u003cn.siva.b@gmail.com\u003e"
    },
    {
      "commit": "f144abc5fce028d12c621bbea09232091474ae46",
      "tree": "b440634396d574c8143fdc8e939353944324d5dd",
      "parents": [
        "5b6860713e603aca273c98e4cb99ef02f940996b"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Thu Apr 16 09:49:03 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 16 09:49:03 2026 -0700"
      },
      "message": "feat: Include ReverseOrderHoodieRecordPayload (#17928)\n\nThis PR addresses the need for reverse-order payload merging in Hudi, where the oldest record (based on ordering field) should be preserved instead of the latest. It also adds configurability to control behavior when ordering values are equal, and optimizes the default payload to avoid unnecessary rewrites.\n\nWhat users gain:\n\nNew ReverseOrderHoodieRecordPayload class for use cases requiring oldest-record-wins semantics\nConfiguration option to control update behavior when ordering field values are equal\nPerformance improvement by avoiding unnecessary record rewrites when incoming records are older"
    },
    {
      "commit": "5b6860713e603aca273c98e4cb99ef02f940996b",
      "tree": "29235e9f25ac4af3d3c64e8f445687ccd9bfeea4",
      "parents": [
        "cad1530b03ca35728eff2dc297500d978cac04d5"
      ],
      "author": {
        "name": "mailtoboggavarapu-coder",
        "email": "mailtoboggavarapu@gmail.com",
        "time": "Wed Apr 15 23:37:12 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 16 11:37:12 2026 +0800"
      },
      "message": "fix: fix Scanner file handle leak in HiveIncrementalPuller.executeIncrementalSQL (#18457)"
    },
    {
      "commit": "cad1530b03ca35728eff2dc297500d978cac04d5",
      "tree": "be8be8234151a76abe931fb437cf57eded5fe70b",
      "parents": [
        "5066fcc46e54cb8a1df13da31b89a5dd4cab367c"
      ],
      "author": {
        "name": "mailtoboggavarapu-coder",
        "email": "mailtoboggavarapu@gmail.com",
        "time": "Wed Apr 15 23:35:55 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 16 11:35:55 2026 +0800"
      },
      "message": "fix: Scanner resource leak in SqlFileBasedSource.fetchNextBatch (#18467)\n\n* Fix Scanner resource leak in SqlFileBasedSource.fetchNextBatch"
    },
    {
      "commit": "5066fcc46e54cb8a1df13da31b89a5dd4cab367c",
      "tree": "2fabf1844636f71678997dd1b94599c1e99567e0",
      "parents": [
        "d3e020132a9151a34b54534c717892c43ebb0eac"
      ],
      "author": {
        "name": "Sivabalan Narayanan",
        "email": "n.siva.b@gmail.com",
        "time": "Wed Apr 15 19:27:47 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 19:27:47 2026 -0700"
      },
      "message": "feat: Adding rolling extra metadata support  (#18421)\n\nMotivation: In streaming ingestion pipelines, checkpoint information is critical for exactly-once processing and failure recovery. Currently, users must manually track and pass this metadata with every commit, or walk back the timeline to find the latest checkpoint. This PR automates this process by allowing users to configure specific metadata keys that are automatically carried forward to every subsequent commit.\n\nSummary:\nUsers can now configure specific extra metadata keys (e.g., checkpoint.offset, checkpoint.partition) that Hudi will automatically carry forward across all commits. When a new commit is created, Hudi checks recent commits for these configured keys and merges them into the current commit\u0027s metadata. This ensures checkpoint information is always available in the latest commit without manual intervention.\n\nDetailed Changelog:\n\nConfiguration Changes:\n\nAdded hoodie.write.rolling.metadata.keys (advanced, default: empty) - Comma-separated list of metadata keys to automatically roll forward\nAdded hoodie.write.rolling.metadata.timeline.lookback.commits (advanced, default: 10) - Maximum number of recent commits to search when looking for missing rolling metadata keys\nImplementation Changes:\n\nAdded BaseHoodieWriteClient.mergeRollingMetadata() method that:\nExecutes within the transaction lock after write conflict resolution in preCommit()\nWalks back timeline in reverse order (most recent first) to find latest values for configured keys\nMerges found values into current commit\u0027s extra metadata\nUses fresh timeline view (either from createTable() or reloaded during conflict resolution)\nSkips metadata table (applies only to data tables)\nStops early once all keys are found (performance optimization)\nErrors in rolling metadata merge do not fail the commit (non-blocking)\nAdded getter methods getRollingMetadataKeys() and getRollingMetadataTimelineLookbackCommits() in HoodieWriteConfig\nAdded builder methods withRollingMetadataKeys() and withRollingMetadataTimelineLookbackCommits() in HoodieWriteConfig.Builder"
    },
    {
      "commit": "d3e020132a9151a34b54534c717892c43ebb0eac",
      "tree": "92db895a58fd051442090912350e9d6f514f7ba4",
      "parents": [
        "12b3a06c2cc6756333e1302246a8999180fc21f4"
      ],
      "author": {
        "name": "Krishen",
        "email": "22875197+kbuci@users.noreply.github.com",
        "time": "Wed Apr 15 15:30:29 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 15:30:29 2026 -0700"
      },
      "message": "fix(common): FutureUtils:allOf should always throw root cause exception (#18456)\n\nFutureUtils.allOf() has a race condition that causes the original root-cause exception to be silently replaced by a CancellationException, making it impossible to diagnose failures in any code path that uses it — most notably MultipleSparkJobExecutionStrategy.performClustering(), which executes clustering groups in parallel using FutureUtils.allOf(). Fixing the same in this patch\n\n\n\n\n---------\n\nCo-authored-by: Krishen Bhan \u003c“bkrishen@uber.com”\u003e"
    },
    {
      "commit": "12b3a06c2cc6756333e1302246a8999180fc21f4",
      "tree": "7b7d3a5bbefebfec2f5620a5d29016b27bc20a8e",
      "parents": [
        "98f90c14bb953958df8cc0ca53f2ff10e95f29d2"
      ],
      "author": {
        "name": "Sivabalan Narayanan",
        "email": "n.siva.b@gmail.com",
        "time": "Wed Apr 15 10:52:18 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 10:52:18 2026 -0700"
      },
      "message": "feat: Support to cap max commits to clean in one round of clean commit (#18322)\n\nWhen the cleaner is disabled for an extended period and the timeline grows large (e.g., 1000+ commits), resuming the cleaner can attempt to clean hundreds or thousands of commits worth of file slices in a single operation. This can cause memory pressure during clean planning, long-running operations that may timeout, OOM errors, and operational instability.\n\nThis PR introduces a new configuration hoodie.clean.max.commits.to.clean to cap the maximum number of commits that can be cleaned in a single clean operation, allowing for gradual, incremental cleanup over multiple clean runs.\n\nSummary:\nUsers can now configure hoodie.clean.max.commits.to.clean to limit the number of commits cleaned per operation. This prevents resource exhaustion when resuming cleaning after a long period of inactivity. The cleaner will incrementally catch up over multiple runs, cleaning up to the configured limit each time.\n\nChangelog:\n\nfeat(clean): Add hoodie.clean.max.commits.to.clean configuration to cap commits cleaned per operation\nAdded MAX_COMMITS_TO_CLEAN config in HoodieCleanConfig with default value Long.MAX_VALUE\nAdded getMaxCommitsToClean() accessor in HoodieWriteConfig\nAdded withMaxCommitsToClean() builder method in HoodieCleanConfig.Builder\ncore: Updated CleanerUtils.getEarliestCommitToRetain() to support capping\nExtended method signature to accept previousEarliestCommitToRetain and maxCommitsToClean parameters\nAdded capCommitsToClean() helper method to adjust earliest commit when cap is exceeded\nLogs when capping is applied with before/after commit counts\ncore: Updated CleanPlanner.getEarliestCommitToRetain() to retrieve previous clean metadata\nReads earliestCommitToRetain from last completed clean\u0027s metadata\nPasses previous clean info and config to CleanerUtils.getEarliestCommitToRetain()\nGracefully handles missing previous clean metadata (no capping applied)\ncore: Updated ArchivalUtils.getEarliestCommitToRetain() to pass empty values for new parameters\nArchival continues to work without capping (uses Option.empty() and Long.MAX_VALUE)\ntest: Added comprehensive unit tests in TestCleanerUtils\nTests for KEEP_LATEST_COMMITS policy with/without capping\nTests for KEEP_LATEST_BY_HOURS policy with/without capping\nTests for boundary conditions, missing previous clean, and default values\nAdded helper methods to create mock timelines with realistic timestamps"
    },
    {
      "commit": "98f90c14bb953958df8cc0ca53f2ff10e95f29d2",
      "tree": "72b0d66ca59ef912d43b468803e586f67c5aa72b",
      "parents": [
        "4b15e50e8c95046cfadd24c612e92b6a73df15a6"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Wed Apr 15 10:41:23 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 10:41:23 2026 -0700"
      },
      "message": "fix(payload): support sentinel no-op updates in DefaultHoodieRecordPayload (#18413)\n\nSummary: \nThis PR updates DefaultHoodieRecordPayload to avoid rewriting records when the incoming payload should not win the merge based on ordering semantics. Instead of returning the current persisted record, it can return a sentinel no-op result so the existing record is preserved without an unnecessary rewrite. The change also makes equal-ordering behavior configurable.\n\nChanges:\n\nreturn SENTINEL from combineAndGetUpdateValue() when the incoming record should not update the persisted record\nadd canProduceSentinel() support in DefaultHoodieRecordPayload\nadd hoodie.payload.update.on.same.ordering.field config with default true\nadd tests covering equal-ordering, older incoming record, newer incoming record, and default behavior\n\n\n---------\n\nCo-authored-by: sivabalan \u003cn.siva.b@gmail.com\u003e"
    },
    {
      "commit": "4b15e50e8c95046cfadd24c612e92b6a73df15a6",
      "tree": "427e94dc4e1e183964188ea06f8713064d6c5437",
      "parents": [
        "8cd264825dcc720ce1ec02bba6eba7758e451824"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Wed Apr 15 18:03:08 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 18:03:08 2026 +0800"
      },
      "message": "feat(sync): Map VARIANT type to struct in Hive, Spark, and BigQuery sync (#18483)\n\n- VARIANT columns were not handled in HiveSchemaUtil, SparkSchemaUtils, and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with variant columns.\n- Map VARIANT to its underlying physical type (struct) so external engines can read via metastore.\n- Add tests for nullable and nested VARIANT"
    },
    {
      "commit": "8cd264825dcc720ce1ec02bba6eba7758e451824",
      "tree": "c45d81ed0ee9769d48939a56a01c08d4cb5b2ae8",
      "parents": [
        "da2667c4c6e6b7cd34edf85c29fca271f14aa71d"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Wed Apr 15 17:04:25 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 17:04:25 2026 +0800"
      },
      "message": "chore: Allow versions to be specified in build_docker_images.sh (#17948)"
    },
    {
      "commit": "da2667c4c6e6b7cd34edf85c29fca271f14aa71d",
      "tree": "7e28fb5ac5d4e8a0f91fcc16baa55551b1221c73",
      "parents": [
        "0bdab84bc35fad3e6590651b3740815d95b27e5e"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Wed Apr 15 17:02:04 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 17:02:04 2026 +0800"
      },
      "message": "feat(sync): Map BLOB type to struct in Hive and BigQuery sync (#18482)\n\n* feat(sync): Map BLOB type to struct in Hive and BigQuery sync\n\n- BLOB columns were not handled in HiveSchemaUtil and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with blob columns.\n- Map BLOB to its underlying physical type (struct) so external engines can read via Hive metastore and BigQuery.\n\n* Add test for nullable and nested blobs"
    },
    {
      "commit": "0bdab84bc35fad3e6590651b3740815d95b27e5e",
      "tree": "bcfefebf4b59aa56fc8873072e2ba5e5ae66afbe",
      "parents": [
        "613fc49955489331c92376aab4f7aed6eea45817"
      ],
      "author": {
        "name": "Shuo Cheng",
        "email": "njucshuo@gmail.com",
        "time": "Wed Apr 15 10:39:29 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 15 10:39:29 2026 +0800"
      },
      "message": "feat(flink): Add metrics for RocksDB index backend in bucket assigner (#18484)\n\n* feat(flink): Add metrics for RocksDB index backend in bucket assigner\n\n* fix comments"
    },
    {
      "commit": "613fc49955489331c92376aab4f7aed6eea45817",
      "tree": "1ca8cc89b071eb16c0d9f3e398ef0cdd8dd9faf1",
      "parents": [
        "fc7f30301e94690780f348e33af39041aafcf7ff"
      ],
      "author": {
        "name": "Surya Prasanna",
        "email": "syalla@uber.com",
        "time": "Mon Apr 13 15:22:54 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 13 15:22:54 2026 -0700"
      },
      "message": "feat(common): add log reader scan metrics and logging for log block processing (#18412)\n\nPorts the logging and monitoring improvements from the earlier AbstractHoodieLogRecordReader\nchanges into the current BaseHoodieLogRecordReader implementation.\n\nThe current reader was missing block-level scan visibility and downstream propagation of the\nresulting metrics, which made it harder to understand log scanning behavior during compaction\nand related read paths.\n\n---------\n\nCo-authored-by: sivabalan \u003cn.siva.b@gmail.com\u003e"
    },
    {
      "commit": "fc7f30301e94690780f348e33af39041aafcf7ff",
      "tree": "d0b87e3dcde6f69aac02246f8ed2d7947ecc44e7",
      "parents": [
        "00a406682edb0d803705a60756d7e3ef00a9c34b"
      ],
      "author": {
        "name": "dependabot[bot]",
        "email": "49699333+dependabot[bot]@users.noreply.github.com",
        "time": "Mon Apr 13 17:59:45 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon Apr 13 17:59:45 2026 +0800"
      },
      "message": "chore(deps): bump org.apache.logging.log4j:log4j-core (#18490)\n\nBumps org.apache.logging.log4j:log4j-core from 2.25.3 to 2.25.4.\n\n---\nupdated-dependencies:\n- dependency-name: org.apache.logging.log4j:log4j-core\n  dependency-version: 2.25.4\n  dependency-type: direct:production\n...\n\nSigned-off-by: dependabot[bot] \u003csupport@github.com\u003e\nCo-authored-by: dependabot[bot] \u003c49699333+dependabot[bot]@users.noreply.github.com\u003e"
    },
    {
      "commit": "00a406682edb0d803705a60756d7e3ef00a9c34b",
      "tree": "6d26f7383a8f1d5c6ffecefcf7984d7c00bdd709",
      "parents": [
        "88c146e8072bf185a5a636af91bb92def7cd67dd"
      ],
      "author": {
        "name": "voonhous",
        "email": "voonhousu@gmail.com",
        "time": "Sun Apr 12 23:18:56 2026 +0800"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Sun Apr 12 23:18:56 2026 +0800"
      },
      "message": "feat(sync): Map VECTOR type to binary for metastore sync support (#18480)\n\n* fix(sync): Map VECTOR type to binary in Hive, Spark, and BigQuery sync (#18343)\n\n- VECTOR columns were not handled in sync schema converters, causing UnsupportedOperationException when syncing tables with vector columns to external metastores.\n- Map VECTOR to its underlying physical type (binary/BYTES) so engines like Trino can read via Hive metastore.\n\n* Remove unsued test"
    }
  ],
  "next": "88c146e8072bf185a5a636af91bb92def7cd67dd"
}