)]}'
{
  "log": [
    {
      "commit": "638270effc4751a72fa0a999855ca226b63c39df",
      "tree": "7c1e811183bede2dae1297d8877697ee1a2a405e",
      "parents": [
        "90b9b44eb18228a580b5fbce9293e2d750f01d2f"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Fri Jun 12 16:41:03 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Fri Jun 12 16:41:03 2026 +0800"
      },
      "message": "[CELEBORN-2016] Fix the worker decommission and graceful shutdown condition\n\n### What changes were proposed in this pull request?\n\nIn current code, we set the shutdown hook for `timeout` time and condition tries to check `timeSpent \u003c timeout` when this condition is will become true, shutdown hook timer is already passed and VM will exit without executing the code below this point.\n\nNew condition will be `timeSpent + interval \u003c timeout`, so we will get (0, interval] time to execute the below code.\n\n### Why are the changes needed?\n\nWe current shutdown logic we have seen worker getting shutdown abruptly with timeout exception without completely executing the shutdown hook because of which Celeborn is unable to print unreleased partition location and unreleased shuffle on decommission and graceful shutdown.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nNA\n\nCloses #3727 from s0nskar/shutdown_fix.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "90b9b44eb18228a580b5fbce9293e2d750f01d2f",
      "tree": "332aaeb5383fb9681ae0a588cea7cdcdf3fa7b90",
      "parents": [
        "61499a0c9ac40786c0e846009f0ea77e1b1d42ed"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Thu Jun 11 22:38:03 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Thu Jun 11 22:38:03 2026 +0800"
      },
      "message": "[CELEBORN-2351] Partition file sorting should only be paused for PUSH_AND_REPLICATE_PAUSED\n\n### What changes were proposed in this pull request?\n\nPartition file sorting should only be paused for `PUSH_AND_REPLICATE_PAUSED`, which represent very high memory pressure and cause OOM for workers. Sorting should be allowed for `PUSH_PAUSED` state.\n\n### Why are the changes needed?\n\nCurrently even for push pause state we stop the sorting for partition files. If pause is sustained for a longer time then sorting can timeout and reader waiting for sorting will fail or be delayed.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [X] Yes\n\n### How was this patch tested?\n\nExisting UTs\n\nCloses #3720 from s0nskar/sort_memory_ready.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "61499a0c9ac40786c0e846009f0ea77e1b1d42ed",
      "tree": "5a9037ff6d490900504e5a5e8db5a497ac383fd5",
      "parents": [
        "b1a7cb4964f55a05544334befb0e49f78001f872"
      ],
      "author": {
        "name": "夷羿",
        "email": "yiyi.zt@alibaba-inc.com",
        "time": "Thu Jun 11 21:55:41 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Thu Jun 11 21:55:41 2026 +0800"
      },
      "message": "[CELEBORN-2319] Standalone LifecycleManager \u0026\u0026 rust sdk\n\n### What changes were proposed in this pull request?\n\nThis PR introduces two major features to support **non-JVM (C++/Rust) clients** using Apache Celeborn for shuffle:\n\n**1. Standalone LifecycleManager Daemon (Scala/JVM)**\n\n- Added `LifecycleManagerDaemon` — a standalone JVM process that hosts a `LifecycleManager` independently from any compute engine (Spark/Flink) Driver. It installs a shutdown hook (with a watchdog that force-halts if graceful stop exceeds the timeout) and blocks until SIGINT/SIGTERM.\n- Added `LifecycleManagerDaemonArguments` for CLI argument parsing (`--app-id`, `--master-endpoints`, `--port`/`-p`, `--host`, `--properties-file`, `-h`/`--help`). Parsing is a pure function that throws `ArgumentParseException` (carrying an exit code) so every branch is unit-testable; `parseOrExit` wraps it for the process entry point.\n- Added `sbin/start-lifecycle-manager.sh` launch script with classpath assembly, environment loading, required-argument validation, automatic free-port selection, and **RPC-port polling** to confirm the daemon is actually bound before reporting success.\n- Added a new **`lifecycle-manager` Maven/sbt module** that depends on `celeborn-service`, `celeborn-client` and `celeborn-common`. Registered the module in both the root `pom.xml` and `project/CelebornBuild.scala` (`projectDefinitions`), and wired it into `build/make-distribution.sh` for both the Maven and sbt build paths.\n- **Security note**: the standalone LM runs without authentication; it logs a warning on startup and the code documents that operators must bind it to a trusted network only.\n\n**2. Rust SDK via C++ FFI (`rust/` directory)**\n\n- `celeborn-client-sys`: Low-level FFI crate bridging Rust ↔ C++ via a **plain C ABI** (no `cxx`). The C++ side exposes `celeborn_ffi_*` functions (`create_client`, `setup_lifecycle_manager`, `shutdown`, `push_data`, `mapper_end`, partition reader open/read/close, etc.) returning status codes plus heap-allocated error strings.\n  - `build.rs`: links the single aggregated shared library `libceleborn_client.{so,dylib}` (which whole-archives all internal static libs and hides non-`celeborn` symbols), so downstream Rust never sees folly / protobuf / glog / abseil.\n- `celeborn-client`: Safe, ergonomic Rust wrapper providing `ShuffleClient` with:\n  - Input validation extracted into a pure `validate_connect_args` (app_id non-empty, port \u003e 0, codec ∈ {NONE, LZ4, ZSTD}) — testable without a live cluster.\n  - Documented `Send + Sync` rationale (the C++ `ShuffleClientImpl` synchronizes internally), enabling `Arc\u003cShuffleClient\u003e` sharing for concurrent `\u0026self` push/read.\n  - `Drop`-safe shutdown that nulls the handle to avoid a double `celeborn_ffi_shutdown`. The native handle is **intentionally leaked** after shutdown to dodge a folly `EventBase` teardown race; this implies a **per-process-client** usage model, which is documented prominently on the type.\n- Two example programs (`data_sum_writer.rs`, `data_sum_reader.rs`) mirroring the existing C++ `DataSumWithWriterClient` / `DataSumWithReaderClient` test programs. The writer seeds its RNG **per mapper thread** so each thread emits a distinct byte stream, genuinely exercising the concurrent push path.\n\n**3. C++ build portability (`cpp/CMakeLists.txt`)**\n\n- Discover the Homebrew prefix via `brew --prefix` instead of hard-coding `/opt/homebrew` (so Intel macOS under `/usr/local` works too).\n- Discover Abseil via `find_package(absl CONFIG)`, falling back to a toolchain-derived GNU multiarch dir (covers `aarch64-linux-gnu`, which the README advertises); a missing Abseil is now a `FATAL_ERROR` rather than a silent warning.\n- Guard the x86-only `-msse4.2` flag by architecture so aarch64 builds compile.\n\n### Why are the changes needed?\n\nCurrently, `LifecycleManager` can only run **embedded inside a JVM-based compute engine Driver** (e.g., Spark Driver). This makes it impossible for non-JVM applications (Daft engine, etc.) to use Celeborn as their shuffle service, because:\n\n1. The C++ client requires a running `LifecycleManager` to coordinate shuffle metadata (register shuffles, allocate slots, manage partition locations) with Celeborn Masters and Workers.\n2. Without a standalone `LifecycleManager`, non-JVM applications have no way to bootstrap this coordination layer.\n\nBy decoupling the `LifecycleManager` into a **standalone daemon process**, any client — regardless of language runtime — can connect to it via RPC. The Rust SDK then leverages this architecture to provide first-class Rust support by bridging to the existing, battle-tested C++ client implementation via FFI.\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nYes.\n\n- **New component**: Users can now start a standalone `LifecycleManager` daemon via `sbin/start-lifecycle-manager.sh --app-id \u003cid\u003e --master-endpoints \u003ceps\u003e [--port \u003cport\u003e] [--host \u003chost\u003e]`.\n- **New SDK**: Rust applications can now use the `celeborn-client` crate to perform shuffle read/write operations against a Celeborn cluster.\n- **Limitation**: The standalone `LifecycleManager` does **not** support auth (`celeborn.auth.enabled` must be `false`), as the C++/Rust clients lack SASL support. Deploy it on a trusted network only.\n\n### How was this patch tested?\n\n- **Unit tests (JVM)**: `LifecycleManagerDaemonArgumentsSuite` covers the happy paths plus help / unknown-arg / missing-arg / invalid-port branches and `applyArgsToConf` (13 tests). Run via Maven (`mvn -pl lifecycle-manager test`) and compiled/style-checked under sbt as a first-class module.\n- **Unit tests (Rust)**: `validate_connect_args` is covered by `cargo test -p celeborn-client` (codec/app_id/port validation) without requiring a cluster.\n- **Integration**: The Rust SDK was validated using the `data_sum_writer` / `data_sum_reader` examples (Rust ports of `DataSumWithWriterClient.cpp` / `DataSumWithReaderClient.cpp`), which write random numeric data across partitions and verify correctness by comparing partition sums between writer and reader. The `LifecycleManagerDaemon` was tested by starting it against a local Celeborn cluster (Master + Workers) and verifying the Rust examples connect, push, and read through the daemon.\n\nCloses #3677 from gavin9402/standalone_and_rust.\n\nLead-authored-by: 夷羿 \u003cyiyi.zt@alibaba-inc.com\u003e\nCo-authored-by: Zhou \u003cgavin9402@163.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "b1a7cb4964f55a05544334befb0e49f78001f872",
      "tree": "5166bf1502fba8a73e9973967c2bed4210e8b9bc",
      "parents": [
        "4a9112a594ddc3f55a98172e643af1d08d72740d"
      ],
      "author": {
        "name": "James Xu",
        "email": "xumingmingv@gmail.com",
        "time": "Thu Jun 11 16:15:31 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Thu Jun 11 16:15:31 2026 +0800"
      },
      "message": "[CELEBORN-2315][FOLLOWUP] Change assertIteratorFullyConsumed to throw CelebornIOException\n\n### What changes were proposed in this pull request?\n\nReplace TaskKilledException with CelebornIOException in assertIteratorFullyConsumed. CelebornIOException extends IOException and fits the existing throws IOException contract of all write() methods cleanly, without needing an unchecked exception workaround.\n\nAlso revert the throwTaskKillException(String message) overload added to TaskInterruptedHelper in CELEBORN-2315, which is now dead code.\n\n### Why are the changes needed?\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nCloses #3728 from xumingming/celeborn-2315-followup-celebornioexception.\n\nAuthored-by: James Xu \u003cxumingmingv@gmail.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "4a9112a594ddc3f55a98172e643af1d08d72740d",
      "tree": "75e8385b7dc6182982983f18561878476832985f",
      "parents": [
        "b131a400a239486fe36c659ed6b4397a27272a82"
      ],
      "author": {
        "name": "James Xu",
        "email": "xumingmingv@gmail.com",
        "time": "Thu Jun 11 14:30:07 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Thu Jun 11 14:30:07 2026 +0800"
      },
      "message": "[CELEBORN-2315] Add iterator fully-consumed validation after shuffle write\n\n### What changes were proposed in this pull request?\n\nAdds a post-write safety check to HashBasedShuffleWriter and SortBasedShuffleWriter: after the write loop completes, verify the input iterator was fully consumed. If records remain, kill the task with TaskKilledException. This guards against silent data loss.\n\n### Why are the changes needed?\n\nIt could give another layer of correctness guarantee.\n\n### Does this PR resolve a correctness bug?\n\nEnhance correctness guarantee.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nUT.\n\nCloses #3672 from xumingming/iterator-fully-consumed-check.\n\nAuthored-by: James Xu \u003cxumingmingv@gmail.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "b131a400a239486fe36c659ed6b4397a27272a82",
      "tree": "654ccf3ba5e5d9832fafd4f959a70fa3304dfa50",
      "parents": [
        "e5099f90d55e7b70292b0677c3a889f72c9b7efb"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Tue Jun 09 03:31:17 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Tue Jun 09 03:31:17 2026 +0800"
      },
      "message": "[MINOR] Fix PbSerDeUtilsTest failure\n\n### What changes were proposed in this pull request?\n\nFix PbSerDeUtilsTest failure for spark4 jobs\n\n```\nError:  /home/runner/work/celeborn/celeborn/common/src/test/scala/org/apache/celeborn/common/util/PbSerDeUtilsTest.scala:851: comparing values of types Boolean and Boolean using `equals` unsafely bypasses cooperative equality; use `\u003d\u003d` instead\nError: [ERROR] one error found\n```\nhttps://github.com/apache/celeborn/actions/runs/27126367363/job/80055898287?pr\u003d3720\n\n### Why are the changes needed?\n\nAfter this change https://github.com/apache/celeborn/pull/3675, some of tests are failing.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nExisting UTs.\n\nCloses #3722 from s0nskar/CELEBORN-1577_fix_test.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "e5099f90d55e7b70292b0677c3a889f72c9b7efb",
      "tree": "b07d9652d7b1b4817088d7c3eb63709dd5d0585c",
      "parents": [
        "a0f098f75eff99ae4b914678c5eeacc089223889"
      ],
      "author": {
        "name": "James Xu",
        "email": "xumingmingv@gmail.com",
        "time": "Tue Jun 09 03:18:57 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Tue Jun 09 03:18:57 2026 +0800"
      },
      "message": "[CELEBORN-2313] Extend E2E checked zone to batch assembly point\n\n### What changes were proposed in this pull request?\n\nCeleborn\u0027s E2E integrity check computes CRC_M inside `ShuffleClientImpl.pushOrMergeData()`, which runs in the async `DataPusher` thread. This leaves the segment from batch assembly in the writer thread through the `DataPusher` queue entirely outside the checked zone — meaning any corruption that occurs in that window is invisible to the integrity check and reaches reducers silently.\n\nThis change closes that gap and enables detection of a class of correctness bugs where data corruption occurs between batch assembly and async push dispatch, including bugs involving shared buffer pool references.\n\nIntroduce `ShuffleClient.computeBatchCRC()` and consolidate its invocation into two choke points:\n\n- `DataPusher.addTask()`: covers all async push paths. The CRC is recorded on the writer thread immediately before the buffer is enqueued, so `DataPusher.pushData()` intentionally uses the bare `client.pushData()` to avoid double-counting the same batch into CommitMetadata.\n\n- `ShuffleClient.pushDataWithCRC()` and `ShuffleClient.mergeDataWithCRC()`: new concrete convenience methods that call `computeBatchCRC()` then delegate to the abstract `pushData()`/`mergeData()`. These cover all synchronous push paths (`pushGiantRecord`, `close()` flush). The abstract `pushData()`/`mergeData()` are now documented as internal-use-only; all writer call sites across `HashBasedShuffleWriter` (spark-2/3), `SortBasedShuffleWriter` (spark-2/3), and `SortBasedPusher` use the `WithCRC` variants instead.\n\nThe now-redundant CRC computation inside `pushOrMergeData()` is removed.\n\nThis consolidation eliminates 7 scattered `computeBatchCRC` call sites that previously had to be manually paired with each push/merge call, reducing the risk of a future call site omitting the CRC step.\n\n### Why are the changes needed?\n\nEnhance E2E Integrity Check, so it can cover more code path.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nUnit Test.\n\nCloses #3716 from xumingming/extend-e2e-checked-zone-v2.\n\nAuthored-by: James Xu \u003cxumingmingv@gmail.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "a0f098f75eff99ae4b914678c5eeacc089223889",
      "tree": "0fe8b39856689adfde2af291490d8a53b7474178",
      "parents": [
        "93de9b3042f419fa914144cd640e18e8995a3b8b"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Mon Jun 08 14:25:00 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Mon Jun 08 14:25:00 2026 +0800"
      },
      "message": "[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess\n\n### What changes were proposed in this pull request?\n\nChange get reducer file group cache to `expireAfterAccess`.\n\n### Why are the changes needed?\n\nCurrently the policy is expireAfterWrite which is not efficient, as it strictly clears the cache after the timeout, without considering that that entry was hot or not. `expireAfterAccess` will make sure to only clear if it was not actively not being accessed.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nExisting UTs.\n\nCloses #3717 from s0nskar/cache_policy.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "93de9b3042f419fa914144cd640e18e8995a3b8b",
      "tree": "643e58bbd6306bad39a633eeb87f29d0fd0ed975",
      "parents": [
        "b10546448657d430cc33d8607012a34135217498"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Mon Jun 08 14:03:26 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Mon Jun 08 14:14:15 2026 +0800"
      },
      "message": "[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks\n\n### What changes were proposed in this pull request?\n\npush-merged data, which exercises this fetch path more aggressively and reliably triggered an EDEADLK abort.\n\nThe onSuccess_/onFailure_ callbacks are invoked on the TransportClient\u0027s IO thread and capture a weak_ptr that is lifted to a shared_ptr inside the callback body. When that local shared_ptr happens to hold the last reference, dropping it inline runs WorkerPartitionReader on the IO executor\u0027s own thread, which transitively destroys the embedded TransportClient and its IOThreadPoolExecutor. The executor then attempts to pthread_join the thread that is currently executing the callback and the join fails with EDEADLK, aborting the process.\n\nHand the final reference off to the global CPU executor so destruction of the reader (and the IO executor underneath it) always happens on a different thread than the one running the callback.\n\n### Why are the changes needed?\n\nWhen running the bytedance bolt celeborn e2e tests for push-merged data I am working on, I run into this error.\n\n### Does this PR resolve a correctness bug?\n\nYes\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nI ran the Celeborn bolt e2e tests with this change and the tests passed with push merged data support.\n\nCloses #3693 from afterincomparableyum/celeborn-2332.\n\nAuthored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "b10546448657d430cc33d8607012a34135217498",
      "tree": "24139c7ea5c3cbf105f687043f2e35e5a6968a58",
      "parents": [
        "bd2523fa5900e90063cdf3916aa5e9e9d48a49e7"
      ],
      "author": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Fri Jun 05 10:42:12 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Fri Jun 05 10:42:12 2026 +0800"
      },
      "message": "[CELEBORN-2348] Support end-to-end shuffle integrity check for Flink\n\n### What changes were proposed in this pull request?\n\nThis PR extends the end-to-end shuffle integrity checks introduced in CELEBORN-894 (Spark-only) to Flink workloads, covering both the regular and the tiered (hybrid) read paths. When the check is enabled, the write side records a per subpartition CRC32 + byte count and the driver validates it against what the reader actually consumed, failing the read on a mismatch.\n\n- **Write side**: `FlinkShuffleClientImpl` hashes each push payload (the body after the batch header) into `PushState` via a zero-copy `ByteBuffer` view and reports the per-subpartition CRC32/bytes at `MapperEnd`, reusing the existing `crc32PerPartition` / `bytesWrittenPerPartition` plumbing. The constructor fails fast if the write-side `BATCH_HEADER_SIZE` ever diverges from the read-side `BufferUtils.HEADER_LENGTH_PREFIX`.\n- **Read side**: `RemoteBufferStreamReader` and `CelebornChannelBufferReader` accumulate the read CRC32/bytes through a shared `ReadIntegrityTracker` and report them at the last partition\u0027s stream end. The tracker owns the per-path framing/stripping and disables itself on any unexpected buffer shape (wrong component count, or a buffer shorter than the batch header) rather than risk a false mismatch.\n- **Driver side**: `ReadReducerPartitionEnd` is reused for MAP partitions, and `MapPartitionCommitHandler.finishPartition` combines the recorded write-side checksums over the consumed subpartition range, failing closed on a mismatch or missing metadata.\n- Add zero-copy `ByteBuffer` overloads to `CelebornCRC32`, `CommitMetadata` and `PushState`.\n- Minor: drop a stray (cosmetic, no-op) unary plus in `handleReducerPartitionEnd`\u0027s failure branch and update the client config doc.\n\n### Why are the changes needed?\n\nCELEBORN-894 added end-to-end integrity verification only for Spark. Flink workloads — including hybrid/tiered shuffle — had no equivalent guard, so silent shuffle data corruption (bit flips, truncation, mis-framing) could go undetected and surface as wrong results rather than a failed task. This PR brings the same write-vs-read checksum/byte-count validation to Flink so such corruption fails the read instead of being silently consumed.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [x] Yes\n\nThe existing `celeborn.client.shuffle.integrityCheck.enabled` config now also applies to Flink (previously Spark-only); its documentation is updated accordingly. The default remains `false`, so there is no behavior change unless the check is explicitly enabled.\n\n### How was this patch tested?\n\nAdded unit and integration tests:\n\n- `CelebornCRC32Test` / `CommitMetadataTest`: the new `ByteBuffer` overloads (single and split header/data), order-independence, and corruption / byte-count-mismatch detection.\n- `MapPartitionCommitHandlerTest`: `finishPartition` success and all failure branches (no metadata, missing map partition, out-of-bounds range, checksum mismatch, byte-count mismatch), concurrent recording, and the expired-shuffle race.\n- `ReadIntegrityTrackerTest`: report-once / disable semantics and per-path framing for both the regular and tiered read paths.\n- `RemoteBufferStreamReaderTest`: the stream-end-after-close race (a failed report must not notify the failure listener on a closed channel).\n- `CelebornBufferStreamTest`: the `hasRemainingPartitions` location-index boundary.\n- `WordCountTest` (`WordCountTestWithIntegrityCheck`) and `HybridShuffleWordCountTest`: end-to-end Flink runs with the check enabled, on both the regular and hybrid shuffle paths.\n\nCloses #3718 from SteNicholas/CELEBORN-2348.\n\nAuthored-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "bd2523fa5900e90063cdf3916aa5e9e9d48a49e7",
      "tree": "9b0eaa5e3d675c701ba42be8e5fd2d909cffcf12",
      "parents": [
        "bf0a1e8fc281d622fdf176076706d4dc4a52d80e"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Fri Jun 05 10:39:25 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Fri Jun 05 10:39:25 2026 +0800"
      },
      "message": "[CELEBORN-1577][FOLLOWUP] Fix backward compatiblity issue with interrupt shuffle\n\n### What changes were proposed in this pull request?\n\nFix backward compatibility issue with interrupt shuffle by checking if the reason is nonEmpty.\n\n### Why are the changes needed?\n\nIf someone uses a new client with old server which is not sending `CheckQuotaResponse` in `HeartbeatFromApplicationResponse` then proto uses default value to build CheckQuotaResponse with isAvailable\u003dfalse and reason\u003d\"\". In this case the job will always fails without breaching the quota, we should not fail the job if the reason is empty to make it backward compatible.\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nTested in local setup.\n\nCloses #3675 from s0nskar/fix_interrupt.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "bf0a1e8fc281d622fdf176076706d4dc4a52d80e",
      "tree": "e8e42a80ea628463af9dc87c9325c5602e9fe150",
      "parents": [
        "221129bb858f14743412a1d7a25886540333d94e"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Wed Jun 03 17:38:19 2026 +0800"
      },
      "committer": {
        "name": "Nicholas Jiang",
        "email": "programgeek@163.com",
        "time": "Wed Jun 03 17:38:19 2026 +0800"
      },
      "message": "[CELEBORN-2346] Add RequestSlots failure metrics\n\n### What changes were proposed in this pull request?\n\nExport a new master `RequestSlotsFailed` counter with a bounded `status` label for `SLOT_NOT_AVAILABLE` and `WORKER_EXCLUDED`.\n\nThis patch also makes metric-specific labels override configured `celeborn.metrics.extraLabels`, while preserving reserved `role` and `instance` labels. Configured extra labels are snapshotted when a metrics source is constructed so labeled metric keys remain stable.\n\n### Why are the changes needed?\n\n`RequestSlots` placement failures directly affect shuffle registration, but existing master metrics do not expose whether applications are receiving these failure responses. The new counter provides a direct monitoring signal.\n\nWithout the label precedence change, an extra label such as `status\u003dprod` shadows the metric-specific failure reason and collapses both series.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [x] Yes\n\n### How was this patch tested?\n\n- `JAVA_HOME\u003d$(brew --prefix openjdk17) dev/reformat`\n- `JAVA_HOME\u003d$(brew --prefix openjdk17) build/mvn --no-transfer-progress -DskipTests -DprotocPluginExecutable\u003d/tmp/protoc-gen-grpc-java-noop-proto3-optional -pl master -am install`\n- `JAVA_HOME\u003d$(brew --prefix openjdk17) build/mvn --no-transfer-progress -DprotocPluginExecutable\u003d/tmp/protoc-gen-grpc-java-noop-proto3-optional -Dsuites\u003dorg.apache.celeborn.common.metrics.source.CelebornSourceSuite -pl common test-compile scalatest:test`\n- `JAVA_HOME\u003d$(brew --prefix openjdk17) build/mvn --no-transfer-progress -DprotocPluginExecutable\u003d/tmp/protoc-gen-grpc-java-noop-proto3-optional -Dsuites\u003dorg.apache.celeborn.service.deploy.master.MasterSourceSuite,org.apache.celeborn.service.deploy.master.MasterSuite -pl master test-compile scalatest:test`\n\nThe local `protocPluginExecutable` override is an ARM workstation workaround for the downloaded `protoc-gen-grpc-java` artifact.\n\nCloses #3714 from sunchao/CELEBORN-2346-request-slots-failure-metrics.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: Nicholas Jiang \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "221129bb858f14743412a1d7a25886540333d94e",
      "tree": "0b6de78cd91c785d74e9ec84077961c04b88cc87",
      "parents": [
        "c3bee9c5aaf0d05cc9f775401bf731f0d3ad72bb"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Wed Jun 03 10:48:22 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Jun 03 10:48:22 2026 +0800"
      },
      "message": "[CELEBORN-2343] Fix timer leak in handlePushData\n\n### What changes were proposed in this pull request?\n\nFix the timer leak in handlePushData\n\n### Why are the changes needed?\n\nTo avoid timer leak and publish correct `PRIMARY_PUSH_DATA_TIME` and `REPLICA_PUSH_DATA_TIME` metrics\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nNA\n\nCloses #3709 from s0nskar/fix_callback.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c3bee9c5aaf0d05cc9f775401bf731f0d3ad72bb",
      "tree": "61275f16b88af26547dcbb846d3c006e1c5e3f32",
      "parents": [
        "c79937379383b305d43d275eb1e2b9e73d9ab439"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Tue Jun 02 19:57:39 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Jun 02 19:57:39 2026 +0800"
      },
      "message": "[CELEBORN-2310][FOLLOWUP] Account actualUsableSpace while excluding workers\n\n### What changes were proposed in this pull request?\n\nAccount actualUsableSpace while excluding workers. We should use `DeviceInfo.isHealthy()` which actually account for available actualUsableSpace.\n\n### Why are the changes needed?\n\nCurrently we are just checking the DiskStatus, which could not be set if the device monitor is not enabled.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nMinor Change.\n\nCloses #3715 from s0nskar/disk_health.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c79937379383b305d43d275eb1e2b9e73d9ab439",
      "tree": "8d750a203eaebdbfc3e4cdef9cad7b7ce58cbbe7",
      "parents": [
        "2497739705629efd455a2e90f73f4e3a93ca0871"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Tue Jun 02 19:47:04 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Jun 02 19:47:04 2026 +0800"
      },
      "message": "[CELEBORN-2275][CIP-14] Add C++ merge-write and Java-read hybrid integration test\n\n### What changes were proposed in this pull request?\n\nAdd a new C++ test client that for the mergeData/pushMergedData write path and validates data integrity by reading back from the Java ShuffleClient. This complements the existing pushData based hybrid test by covering the merge write path.\n\n  - Add DataSumWithMergeWriterClient.cpp and its CMake build target\n  - Add CppMergeWriteJavaReadTest entry points for NONE, LZ4, and ZSTD compression codecs\n  - Add runCppMergeWriteJavaRead to JavaCppHybridReadWriteTestBase\n  - Update cpp_integration CI workflow to run the new tests\n\n### Why are the changes needed?\n\nThis is to add integration tests for https://github.com/apache/celeborn/pull/3611.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nTested through running unit tests and compiling locally.\n\nCloses #3619 from afterincomparableyum/cpp-client/celeborn-2275.\n\nLead-authored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nCo-authored-by: afterincomparableyum \u003cafterincomparableyum\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "2497739705629efd455a2e90f73f4e3a93ca0871",
      "tree": "067670cec321bb36fffd1c53da6ab1c620ea772a",
      "parents": [
        "c546a21fbbddd559f28966cf0b82d48116b984f5"
      ],
      "author": {
        "name": "James Xu",
        "email": "xumingming@dewu.com",
        "time": "Mon Jun 01 15:00:30 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Jun 01 15:00:30 2026 +0800"
      },
      "message": "[CELEBORN-2342] Fix object aliasing in LegacySkewHandlingPartitionValidator corrupting sub-range metadata\n\n### What changes were proposed in this pull request?\n\nWhen an AQE-skewed partition is split into N sub-ranges, the first sub-range\u0027s CommitMetadata object was stored by reference in both subRangeToCommitMetadataMap and currentCommitMetadataForReducer. Each subsequent sibling RPC mutated that object in-place via addCommitData(), silently inflating the TreeMap entry from bytes(A) to bytes(A)+bytes(siblings). Any task retry then sent the correct bytes(A) but found the inflated value, causing a permanent CelebornIOException mismatch and job abort.\n\n### Why are the changes needed?\n\nThis is a bug of E2E Integrity Check.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nAdded Unit Test.\n\nCloses #3708 from xumingming/fix/legacy-skew-validator-object-aliasing.\n\nAuthored-by: James Xu \u003cxumingming@dewu.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c546a21fbbddd559f28966cf0b82d48116b984f5",
      "tree": "752190003f57a1e71b08db366f740e656a561c01",
      "parents": [
        "2d5098b5ef9f864a0b3aafcd8e06be3b4faf42b0"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Mon Jun 01 14:55:05 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Jun 01 14:55:05 2026 +0800"
      },
      "message": "[CELEBORN-2345] Fix allocation for rpcAskTimeout\n\n### What changes were proposed in this pull request?\n\n- Reuse the `rpcAskTimeout` variable in LifecycleManager\n- Make  `rpcAskTimeout` and `rpcRetryWait` lazy in `RpcEndpointRef`. Since many places are using `RpcEndpointRef` for just endpoint name. Examples are all the callers using `Dispatcher.postMessage`\n\n### Why are the changes needed?\n\n`rpcAskTimeout` and `rpcRetryWait` are causing 4% of total allocations.\n\n\u003cimg width\u003d\"644\" height\u003d\"379\" alt\u003d\"Screenshot 2026-05-29 at 3 23 53 PM\" src\u003d\"https://github.com/user-attachments/assets/d9d3d858-b56a-42c7-bdfb-cbf93cb18d07\" /\u003e\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nMinor change\n\nCloses #3712 from s0nskar/fix_conf_allocation.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "2d5098b5ef9f864a0b3aafcd8e06be3b4faf42b0",
      "tree": "d62b3938aacc6c2cda96a29237174eb7d2358164",
      "parents": [
        "cf8d472718a597fa254bc4db5c38ec65abbdeaaf"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Wed May 27 12:58:21 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed May 27 12:58:21 2026 +0800"
      },
      "message": "[CELEBORN-2339] Add tools.jar into classpath only for Java 8\n\n### What changes were proposed in this pull request?\n\nThis is an enhancement of CELEBORN-1682, limit the `tools.jar` injection only for Java 8.\n\n### Why are the changes needed?\n\nhttps://docs.oracle.com/en/java/javase/17/migrate/migrating-jdk-8-later-jdk-releases.html\n\n\u003e Class and resource files previously stored in lib/rt.jar, lib/tools.jar, lib/dt.jar and various other internal JAR files are stored in a more efficient format in implementation-specific files in the lib directory.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nEnsure the warning has gone on JDK 17.\n\n```\n\"WARNING: cannot locate tools.jar. Expected to find it in either /opt/openjdk-17/lib/tools.jar or /opt/openjdk-17/../lib/tools.jar\"\n```\n\nCloses #3703 from pan3793/CELEBORN-2339.\n\nAuthored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "cf8d472718a597fa254bc4db5c38ec65abbdeaaf",
      "tree": "42f72619479af923854dce8165666aebe39920c4",
      "parents": [
        "dfe7def07fabc4fca6fc786ab45edb02076929eb"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Tue May 26 20:17:38 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed May 27 10:11:15 2026 +0800"
      },
      "message": "[CELEBORN-2338] Remove hardcoded CELEBORN_PRINT_LAUNCH_COMMAND\u003d0\n\n### What changes were proposed in this pull request?\n\nRemove hardcoded `CELEBORN_PRINT_LAUNCH_COMMAND\u003d0` in `bin/celeborn-class`\n\n### Why are the changes needed?\n\nIt should be picked from the environment variable.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\n```\n$ CELEBORN_PRINT_LAUNCH_COMMAND\u003d1 sbin/celeborn-cli --version\n...\nStart to launch /opt/java/openjdk/bin/java -XX:+IgnoreUnrecognizedVMOptions -cp /opt/celeborn/conf::/opt/celeborn/cli-jars/*: org.apache.celeborn.cli.CelebornCli --version\n...\n```\n\nCloses #3702 from pan3793/CELEBORN-2338.\n\nAuthored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "dfe7def07fabc4fca6fc786ab45edb02076929eb",
      "tree": "3d45fae17a9e967316594acb00cb17edb2335e4f",
      "parents": [
        "759e7b547641139576f21ddecbf1db5b42f83b01"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Tue May 26 20:14:54 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue May 26 20:14:54 2026 +0800"
      },
      "message": "[CELEBORN-2337] Celeborn OpenAPI client should not shade slf4j-api\n\n### What changes were proposed in this pull request?\n\nAs the title, it\u0027s a packaging change.\n\n### Why are the changes needed?\n\nI found that `celeborn-cli` always prints such warnings, but actually the slf4j-api and log4j2 jars are correctly present in classpath.\n\n```\nSLF4J: Failed to load class \"org.slf4j.impl.StaticLoggerBinder\".\nSLF4J: Defaulting to no-operation (NOP) logger implementation\nSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.\n```\n\nafter some investigation, I found that `celeborn-openapi-client-*.jar` bundles shaded slf4j classes, which causes the issue.\n\n```\n$ jar tf $CELEBORN_HOME/cli-jars/celeborn-openapi-client-*.jar | grep slf4j\n...\norg/apache/celeborn/shaded/org/slf4j/\norg/apache/celeborn/shaded/org/slf4j/ILoggerFactory.class\norg/apache/celeborn/shaded/org/slf4j/IMarkerFactory.class\norg/apache/celeborn/shaded/org/slf4j/Logger.class\n...\n```\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nTested with `celeborn-cli`, `SLF4J` binding warnings have gone.\n\nCloses #3701 from pan3793/CELEBORN-2337.\n\nAuthored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "759e7b547641139576f21ddecbf1db5b42f83b01",
      "tree": "0050140045d14ca915fc0640af8f38da85672746",
      "parents": [
        "180db931beb410bcfab80d9abb3be9de9aaf2553"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Tue May 26 10:21:44 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue May 26 10:21:44 2026 +0800"
      },
      "message": "[CELEBORN-2331] Parallelize batch open stream client creation\n\n## Why are the changes needed?\n\n`CelebornShuffleReader` batches stream-open requests by worker, but it previously created the data client for each worker serially before sending those already-parallel batch requests. When a reducer reads from multiple workers, connection setup for a slow or unavailable worker can delay useful work against the remaining healthy workers.\n\nParallelizing this setup removes the worker-by-worker wait from the normal path. Because this changes task-side connection scheduling, the optimization also needs an operational fallback that restores the prior behavior without requiring a code rollback.\n\n## What changes were proposed in this PR?\n\nThe reader now first gathers pending stream-open locations by worker address, then creates one data client per distinct worker concurrently using the existing stream-creator pool. Once client setup completes, it sends the existing `BATCH_OPEN_STREAM` requests only for workers with an available client, allowing healthy workers to proceed even if another worker fails during setup.\n\nThe client-creation phase preserves the prior retry behavior for later locations on the same worker when an earlier client attempt fails. It also handles task cancellation explicitly: if the waiting Spark task is interrupted, it restores the interrupt status and cancels unfinished setup work; worker-side interruption is propagated rather than treated as an ordinary retryable failure.\n\nThis optimization is controlled by `celeborn.client.spark.batch.openStream.parallelClientCreation.enabled`, which defaults to `true`. Setting it to `false` selects the original serial client-creation and request-building flow, giving deployments a targeted rollback switch if parallel connection setup causes unexpected operational behavior.\n\n## How was this PR tested?\n\n- Unit tests for parallel client setup, failure/retry handling, cancellation on interruption, and the new configuration default and override.\n- Configuration documentation generation validation for the new client setting.\n- Spotless formatting validation.\n\nCloses #3692 from sunchao/dev/chao/codex/port-pr72-to-oss-main.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "180db931beb410bcfab80d9abb3be9de9aaf2553",
      "tree": "c7cbb71f43b70597816692edd30d6d93f4eec33d",
      "parents": [
        "1c807d99a16e59fca0022b1524ec158bfa869514"
      ],
      "author": {
        "name": "Yuriy Malygin",
        "email": "yuriy@malyg.in",
        "time": "Tue May 26 09:58:05 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue May 26 09:58:05 2026 +0800"
      },
      "message": "[CELEBORN-2326] Bump log4j-core version up to 2.25.4\n\n### What changes were proposed in this pull request?\nBump Log4j2 version from 2.24.3 to 2.25.4\n\n### Why are the changes needed?\nApache Celeborn currently depends on Apache Log4j Core versions affected by CVE-2026-34480.\n\n### Does this PR resolve a correctness bug?\nNo.\n\n### Does this PR introduce any user-facing change?\nNo.\n\n### How was this patch tested?\nCI.\n\nCloses #3684 from yuriymalygin/patch-1.\n\nAuthored-by: Yuriy Malygin \u003cyuriy@malyg.in\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "1c807d99a16e59fca0022b1524ec158bfa869514",
      "tree": "bf41091c8233c969fcff354a9ca6e2f54a2700dd",
      "parents": [
        "91175cecbdbdc5e5b209ac6622f6fe081bc7957a"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Mon May 25 13:59:36 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 25 13:59:36 2026 +0800"
      },
      "message": "[CELEBORN-2330] Fix HA master bootstrap redirect handling\n\n## Why are the changes needed?\n\nIn HA mode, clients can hit a failover window where the master they contact is no longer the leader and returns a `MasterNotLeaderException` with a suggested leader address. The client-facing symptom is an RPC failure surfaced as:\n\n`CelebornException: Exception thrown in awaitResult`\n\nThe redirect signal is still present underneath that wrapper, but the existing bootstrap and retry logic does not consistently preserve and follow it. In particular:\n\n- bootstrap-time redirects can be treated like generic connection failures instead of explicit leader hints\n- suggested leaders can themselves redirect again or fail setup\n- after such failures, the client may not continue cleanly to the remaining configured masters\n\nThe consequence is that a client can fail to establish a master connection during HA leader transitions even when a reachable leader or another configured master is available. That turns a recoverable redirect/failover event into an avoidable client-visible failure and makes rolling upgrades or leader changes noisier than necessary.\n\n## What changes were proposed in this PR?\n\nThis port brings the HA redirect handling fix from `openai/celeborn#70` onto upstream `main`.\n\nWhen a master tells the client which leader to use, the client now keeps that redirect information even if it is wrapped inside another exception, and it actively follows the suggested leader during bootstrap and failover. If that suggested leader points to another leader, the client can continue along that redirect chain instead of giving up too early.\n\nIf the redirect path is no longer useful, for example because the suggested leader cannot be reached, no leader is currently presented, or redirects start looping, the client falls back to the remaining configured masters and keeps searching for a usable endpoint. The retry scan also advances correctly after endpoint setup fails, so one bad redirect does not prevent the client from trying the next viable master.\n\nThe PR also adds focused HA tests that cover the bootstrap redirect cases, chained redirects, redirect cycles, and fallback to configured masters.\n\n## How was this PR tested?\n\n- `build/mvn test -pl common -am -Dtest\u003dMasterClientSuiteJ -DwildcardSuites\u003dorg.apache.celeborn.common.client.__NoSuchSuite__`\n\nCloses #3691 from sunchao/dev/chao/codex/port-pr70-to-oss-main.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "91175cecbdbdc5e5b209ac6622f6fe081bc7957a",
      "tree": "2502139503eb11aeae043448805f223e1d684fa4",
      "parents": [
        "671ef2566196ad090a64eebda563f2efc75fb666"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 25 12:55:40 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 25 12:55:40 2026 +0800"
      },
      "message": "[CELEBORN-2335] Bump Spark from 4.1.1 to 4.1.2\n\n### What changes were proposed in this pull request?\n\nBump Spark from 4.1.1 to 4.1.2.\n\n### Why are the changes needed?\n\nSpark 4.1.2 has been announced to release: [Spark 4.1.2 released](https://spark.apache.org/news/spark-4-1-2-released.html). The profile spark-4.1 could bump Spark from 4.1.1 to 4.1.2.\n\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\n- [ ] Yes\n\n### How was this patch tested?\n\nCI.\n\nCloses #3700 from SteNicholas/CELEBORN-2335.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "671ef2566196ad090a64eebda563f2efc75fb666",
      "tree": "849fa3c499e247100d27fd6d4f92782e64e4342a",
      "parents": [
        "56444d2ea58ed98b0e6bc8612e8c816fb2406e63"
      ],
      "author": {
        "name": "Fei Wang",
        "email": "fwang12@ebay.com",
        "time": "Tue May 19 11:48:00 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue May 19 11:48:00 2026 +0800"
      },
      "message": "[CELEBORN-2328] Auto-apply correctness label from PR template checkbox\n\n### What changes were proposed in this pull request?\n\nReplace the free-text `Yes/No` comment under \"Does this PR resolve a correctness bug?\" with a single checkbox in the PR template. Add a GitHub Actions workflow (`correctness-label.yml`) that automatically adds or removes the `correctness` label based on whether the box is checked, triggered on every PR open/edit.\n\n### Why are the changes needed?\n\nPreviously the note said \"committer will add `correctness` label\" — a manual step that was easy to miss. This automates it: checking the box applies the label immediately, and unchecking it removes the label, with no committer action required.\n\nTo track all correctness PR: https://github.com/apache/celeborn/issues?q\u003dlabel%3Acorrectness\n### Does this PR resolve a correctness bug?\n\n- [ ] Yes\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI workflow logic verified by code review. End-to-end behavior can be confirmed by opening a test PR against the repo and toggling the checkbox.\n\n\u003cimg width\u003d\"1796\" height\u003d\"144\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/ccd25ac5-1b24-4d8d-ab10-e6df406d2843\" /\u003e\n\nCloses #3688 from turboFei/correctness.\n\nAuthored-by: Fei Wang \u003cfwang12@ebay.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "56444d2ea58ed98b0e6bc8612e8c816fb2406e63",
      "tree": "e273d501f3bd4416bc2c42b0f85a21ef7dc6e698",
      "parents": [
        "8b36b3d25106c35d09c52a0e5e1b6e97c0ebd292"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Tue May 19 02:41:10 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue May 19 02:41:10 2026 +0800"
      },
      "message": "[CELEBORN-2327] Add active-slot weight to load-aware placement\n\n## Why are the changes needed?\n\nCeleborn load-aware slot placement currently orders candidate disks using flush and fetch timing only. That can still keep assigning new partitions onto disks that already carry a large amount of reserved active-slot pressure, which makes placement skew worse under overlapping shuffle-heavy workloads. CELEBORN-2327 tracks this gap.\n\n## What changes were proposed in this PR?\n\n- Add an optional, default-off `celeborn.master.slot.assign.loadAware.activeSlotsWeight` config.\n- Include `activeSlots * activeSlotsWeight` in load-aware disk ordering.\n- Thread the new config through the master allocation path.\n- Document the new tuning knob and update the slot-allocation developer docs.\n- Add a regression test showing that, when configured, lower-active-slot disks are preferred over otherwise equivalent disks.\n\n## How was this PR tested?\n\n- `UPDATE\u003d1 build/mvn clean test -pl common -am -Dtest\u003dnone -DwildcardSuites\u003dorg.apache.celeborn.ConfigurationSuite`\n- `./build/mvn -pl master -am -Dtest\u003dSlotsAllocatorSuiteJ -DwildcardSuites\u003dorg.apache.celeborn.NoSuchSuite -DfailIfNoTests\u003dfalse test`\n- `./build/mvn -pl master -am -DskipTests test-compile`\n\nCloses #3685 from sunchao/dev/chao/codex/active-slots-placement-oss.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "8b36b3d25106c35d09c52a0e5e1b6e97c0ebd292",
      "tree": "1f253db3e16b6b5d1319872b8daf930767c09a1b",
      "parents": [
        "f83350f49b9c2e1456f196736da8275367dcc6ee"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 18 14:58:54 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 18 14:58:54 2026 +0800"
      },
      "message": "[CELEBORN-2333] Bump Flink from 2.2.0 to 2.2.1\n\n### What changes were proposed in this pull request?\n\nBump Flink from 2.2.0 to 2.2.1\n\n### Why are the changes needed?\n\nFlink 2.2.1 has been announced to release: [Apache Flink 2.2.1 Release Announcement](https://flink.apache.org/2026/05/15/apache-flink-2.2.1-release-announcement/). The profile flink-2.2 could bump Flink from 2.2.0 to 2.2.1.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3694 from SteNicholas/CELEBORN-2333.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "f83350f49b9c2e1456f196736da8275367dcc6ee",
      "tree": "00d55e86641aee3809d07c45df2cdc7608f094fc",
      "parents": [
        "ee7529f83a1994c2d8f92a5d12166939b7cce895"
      ],
      "author": {
        "name": "The Apache Software Foundation",
        "email": "root-asf-gitbox-commits@apache.org",
        "time": "Mon May 18 11:20:46 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 18 11:20:46 2026 +0800"
      },
      "message": "[INFRA] Set up default rulesets for default and release branches\n\nThis Pull Request enables the repository to conform with the \"sane default security settings\" of the Apache Software Foundation by configuring a default branch ruleset that protects the default branch and any release branches.\n\nNote that `~DEFAULT_BRANCH` is a GitHub symbolic link to the current default branch (HEAD) of the repository and does not need changing.\nIf the managing project does not wish to set up these defaults, please close this Pull Request. Alternatively, the project may merge this Pull Request to apply the changes immediately.\n\nIf no action is taken, this Pull Request will be automatically merged by the Apache Infrastructure team on **2026-06-14** (30 days from now).\n\nFor any further information, please reach us on Slack or at: usersinfra.apache.org\n\nCloses #3690 from asf-gitbox-commits/infrastructure-ruleset-bot/default-branch-protection.\n\nAuthored-by: The Apache Software Foundation \u003croot-asf-gitbox-commits@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "ee7529f83a1994c2d8f92a5d12166939b7cce895",
      "tree": "640acd9d0a9b008eb5c29837315e5ac64bc425af",
      "parents": [
        "71a7d0afa21b807f360a9c5e3d21e45258a6c441"
      ],
      "author": {
        "name": "Saurabh Dubey",
        "email": "saurabhd336@uber.com",
        "time": "Mon May 18 10:53:33 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 18 10:53:33 2026 +0800"
      },
      "message": "[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full\n\n### What changes were proposed in this pull request?\n\nDisk full only lead to HARD_SPLITs as a response to writes. However, doesn\u0027t lead to reserve slot rejections. This means too many write retries (due to HARD_SPLITs on each write attempt) leads to wasted network I/O. We can reject RESERVE_SLOT during disk full to avoid the wasted data write network IO.\n\n### Why are the changes needed?\n\nReject reserve slots during disk full, avoid unnecessary network IO.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nAdded UTs, CI.\n\nCloses #3666 from saurabhd336/diskFullReserveSlotsRejection.\n\nAuthored-by: Saurabh Dubey \u003csaurabhd336@uber.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "71a7d0afa21b807f360a9c5e3d21e45258a6c441",
      "tree": "895974f951981830714017b60dd738fd27662c4e",
      "parents": [
        "9ebbc6b36ea94b1b665954d699d76e4711c3dd94"
      ],
      "author": {
        "name": "Filip Darmanovic",
        "email": "dzeri96@proton.me",
        "time": "Thu May 14 09:56:37 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu May 14 09:56:37 2026 +0800"
      },
      "message": "[CELEBORN-2257] Add reporting of remote disks during registration\n\n### What changes were proposed in this pull request?\n1. Disks reported to the master on registration now include remote disks (HDFS, S3, OSS)\n2. Refactored method names to clarify difference between local and remote disks.\n3. Embedded disk type information into the enum.\n4. Refactored unnecessarily complicated code in the slot assignment and worker registration path.\n\n### Why are the changes needed?\n1. Before the first heartbeat, the master won\u0027t be able to assign slots from the remote disks on the worker.\n2. All other changes are in preparation for better support of remote disks.\n\n### Does this PR resolve a correctness bug?\nNot a correctness bug\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\n**Important**: I want help from the community on how to write tests for this.\n\nCloses #3597 from Dzeri96/CELEBORN-2257.\n\nAuthored-by: Filip Darmanovic \u003cdzeri96@proton.me\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "9ebbc6b36ea94b1b665954d699d76e4711c3dd94",
      "tree": "92e33cb96c1f01129ef8d86970e1b403978c6e9a",
      "parents": [
        "50323e1f323c9432692fcc65bc703be107395288"
      ],
      "author": {
        "name": "pithecuse527",
        "email": "gihong96@gmail.com",
        "time": "Wed May 13 15:35:43 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed May 13 15:35:43 2026 +0800"
      },
      "message": "[CELEBORN-2324] Fix JVMQuake threshold and JVMStat timer unit conversion\n\n### What changes were proposed in this pull request?\n\nThis PR fixes JVMQuake time accounting by preserving threshold config values as milliseconds and converting JVMStat GC timer tick deltas to nanoseconds before updating the token bucket.\n\n### Why are the changes needed?\n\nJVMQuake thresholds were parsed as milliseconds but wrapped as microseconds, making values such as 60s behave like 60ms. JVMStat GC timer metrics are reported in ticks, so using them directly can misaccount GC time.\n\n### Does this PR resolve a correctness bug?\n\nYes\n\n### Does this PR introduce _any_ user-facing change?\nYes\n\n### How was this patch tested?\n\n1. UT - Added unit coverage for JVMQuake threshold parsing and JVMStat tick-to-nanosecond conversion.\n2. E2E -  Verified the patched image in a Kubernetes spark namespace with JVMQuake enabled using `dump.threshold\u003d30s`, `kill.threshold\u003d60s`, and `runtimeWeight\u003d0`. Under repeated GC, the worker stayed Ready with restart count 0, while the original image terminated early because the configured `60s` threshold was effectively interpreted as `60ms`.\n  Verified the kill path with a low-threshold configuration: `dump.threshold\u003d30ms`, `kill.threshold\u003d60ms`, `runtimeWeight\u003d0`, and `check.interval\u003d100ms`. Under GC activity, the worker logged bucket: `62995087` and killThreshold: `60000000`, exited via JVMQuake, and Kubernetes restarted the pod, increasing the restart count from 0 -\u003e 1.\n\nCloses #3682 from pithecuse527/CELEBORN-2324.\n\nAuthored-by: pithecuse527 \u003cgihong96@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "50323e1f323c9432692fcc65bc703be107395288",
      "tree": "6274e395fe2655a7f615551013cc296cb80c93c6",
      "parents": [
        "e329b16ff70b13419e862acf2dcd7bc05d829ab6"
      ],
      "author": {
        "name": "AmandeepSingh285",
        "email": "mailto.amandeep.singh.28@gmail.com",
        "time": "Wed May 13 14:47:38 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed May 13 14:47:38 2026 +0800"
      },
      "message": "[CELEBORN-2316] Introduce metadata operation metrics\n\n## What changes were proposed in this pull request?\n\nIntroduce metrics around RocksDB operations. Metrics to have success and failure count for metadata operations for RocksDB observability.\n\n### Why are the changes needed?\n\nIntroduce metrics around RocksDB metadata operations. Current implementation, metadata operations do not have any observability added. RocksDB goes into a read only mode when any critical errors are encountered which results in all write operations failing. Observability around metadata operations is critical and failures could result in metadata entering an inconsistent state.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nTested in local staging setup -\n\n\u003cimg width\u003d\"2166\" height\u003d\"314\" alt\u003d\"Screenshot 2026-05-11 at 3 06 05 PM\" src\u003d\"https://github.com/user-attachments/assets/f6f75515-f937-436d-9ee3-de6ca83bfcdf\" /\u003e\n\n\u003cimg width\u003d\"1113\" height\u003d\"362\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/89cc0b40-e96c-420e-a5aa-36849fb15fd4\" /\u003e\n\nCloses #3673 from AmandeepSingh285/adding-metadata-metrics.\n\nLead-authored-by: AmandeepSingh285 \u003cmailto.amandeep.singh.28@gmail.com\u003e\nCo-authored-by: amandeeps.28 \u003camandeeps.28@uber.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "e329b16ff70b13419e862acf2dcd7bc05d829ab6",
      "tree": "78a1b72959d3f8a5fb6a7fab2b27437ae4969f32",
      "parents": [
        "a70c8fddc456aeedcf7a5bb94d91fd9e23be278e"
      ],
      "author": {
        "name": "1fanwang",
        "email": "1fannnw@gmail.com",
        "time": "Wed May 13 14:45:53 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed May 13 14:45:53 2026 +0800"
      },
      "message": "[CELEBORN-2253] Fix IndexOutOfBoundsException reading shuffle data from HDFS\n\n### What changes were proposed in this pull request?\n\n`HdfsFlushTask.writeAndRecordMetrics` calls `hdfsStream.write(bytes)`, which writes the full `bytes.length`. When the provider passes a reusable `copyBytes` buffer (whose length is `\u003e\u003d size`), this leaks trailing bytes from previous flushes into the current partition file. Pass the actual readable size to write only `size` bytes.\n\n### Why are the changes needed?\n\nThe S3 and OSS flush paths had the same bug and were fixed in #3600 for CELEBORN-2263; the HDFS path was missed. Without the fix, shuffle data flushed to HDFS can be corrupted when `copyBytes` is reused across flushes, and readers later fail with `IndexOutOfBoundsException` in `CelebornInputStream.fillBuffer`, for example:\n\n```\nIndexOutOfBoundsException: readerIndex(4154253) + length(808530018)\n  exceeds writerIndex(12457470)\n```\n\n### Does this PR resolve a correctness bug?\n\nYes.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNew unit test in `FlushTaskSuite` mirrors the S3/OSS coverage added in #3600. It drives `HdfsFlushTask.flush` with `copyBytes` arrays of three sizes (equal, larger, smaller than the buffer payload), captures the `FSDataOutputStream.write` arguments via Mockito\u0027s `ArgumentCaptor`, and asserts the offset/length pair matches the buffer content. The test fails on master with `ArgumentsAreDifferent` at `FlushTask.scala:128` and passes with the fix.\n\nCloses #3683 from 1fanwang/CELEBORN-2253-fix-hdfs-flush-trailing-bytes.\n\nAuthored-by: 1fanwang \u003c1fannnw@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "a70c8fddc456aeedcf7a5bb94d91fd9e23be278e",
      "tree": "6f815899a0f88fc3d2b0ff82bbf18a14a4eb3cb0",
      "parents": [
        "886e359d5e5a12698c1ed63f181f1f8a086c754b"
      ],
      "author": {
        "name": "Kartikay Bhutani",
        "email": "kbhutani0001@gmail.com",
        "time": "Tue May 12 10:19:34 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue May 12 10:19:34 2026 +0800"
      },
      "message": "[CELEBORN-2306] Master adds shutdown hook for ratis stepdown\n\n### What changes were proposed in this pull request?\n\n- Adds master shutdown hook\n- Updates RAFT stepdown to return bool\n- Calls RAFT stepdown on manager shutdown to do graceful stepdown\n\n### Why are the changes needed?\n\n- Manager.stop() isnt being called from anywhere, it emmits some logs as well for \"Stopping manager\" but since there is no shutdown hook defined, none of them are logged or the function is called at all\n- We faced a certain issue where the leader got removed from service mesh before shutdown and followers redirected to it because it was still running (able to send requests but not receive) for a brief period. This method adds a graceful shutdown option to do a RATIS stepdown before shutting down.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, adds an additional config.\n\n### How was this patch tested?\n\nAdded a tests and validated that.\n\nCloses #3659 from kaybhutani/kartikay/graceful-master-shutdown.\n\nLead-authored-by: Kartikay Bhutani \u003ckbhutani0001@gmail.com\u003e\nCo-authored-by: Zaynt \u003cshuaizhentao@gmail.com\u003e\nCo-authored-by: kartikay \u003ckbhutani0001@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "886e359d5e5a12698c1ed63f181f1f8a086c754b",
      "tree": "cbcb40d79ba9a2a954fe614b2f6369d29b1f73ac",
      "parents": [
        "7340428ba0849fed00e0575efd5fc3c69e79badf"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Mon May 11 17:48:20 2026 +0800"
      },
      "committer": {
        "name": "子懿",
        "email": "programgeek@163.com",
        "time": "Mon May 11 17:48:20 2026 +0800"
      },
      "message": "[CELEBORN-2314] Optimize the performance of DataBatches.requireBatches\n\n### What changes were proposed in this pull request?\n\n`requireBatches(int requestSize)` currently calls `batches.remove(0)` per iteration, which shifts all remaining elements each time, overall O(kn). Replacing with a two pass approach (find split point, then subList(0, count).clear()) reduces this to O(n).\n\n### Why are the changes needed?\n\nThis is a minor performance optimization.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, this is just a performance improvement.\n\n### How was this patch tested?\n\nCI Unit/Integration tests.\n\nCloses #3671 from afterincomparableyum/celeborn-2314.\n\nAuthored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nSigned-off-by: 子懿 \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "7340428ba0849fed00e0575efd5fc3c69e79badf",
      "tree": "1c8ed4ff7b2ea707f342e6b1f9d2ba90e95de4e3",
      "parents": [
        "70ff956d049ab071d6ff93ff55acc4fea0a46635"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 11 09:41:50 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon May 11 09:41:50 2026 +0800"
      },
      "message": "[CELEBORN-2322] Upgrade version of docker/login-action for Login to Docker Hub\n\n### What changes were proposed in this pull request?\n\nUpgrade version of docker/login-action for Login to Docker Hub to `docker/login-action4907a6ddec9925e35a0a9e82d7399ccc52663121`.\n\n### Why are the changes needed?\n\nThere is error of dockerhub login in https://github.com/apache/celeborn/actions/runs/24821432736, which is as follows:\n\n```\nThe action docker/login-actionv3 is not allowed in apache/celeborn because all actions must be from a repository owned by your enterprise, created by GitHub, or match one of the patterns: 1Password/load-secrets-action13f58eec611f8e5db52ec16247f58c508398f3e6, 1Password/load-secrets-action8d0d610af187e78a2772c2d18d627f4c52d3fbfb, 1Password/load-secrets-action92467eb28f72e8255933372f1e0707c567ce2259, 1Password/load-secrets-actiondafbe7cb03502b260e2b2893c753c352eee545bf, AdoptOpenJDK/install-jdk*, BobAnkh/auto-generate-changelog*, DavidAnson/markdownlint-cli2-action07035fd053f7be764496c0f8d8f9f41f98305101, DavidAnson/markdownlint-cli2-actionce4853d43830c74c1753b39f3cf40f71c2031eb9, EnricoMi/publish-unit-test-result-action*, JamesIves/github-pages-deploy-action4a3abc783e1a24aeb44c16e869ad83caf6b4cc23, JamesIves/github-pages-deploy-actiond92aa235d04922e8f08b40ce78cc5442fcfbfa2f, JetBrains/qodana-action89eb4357efd2b52e639f3216e63edaf33b82622b, Jimver/cuda-toolkit3d45d157f327c...\n```\n\n[INFRA-27901](https://issues.apache.org/jira/projects/INFRA/issues/INFRA-27901) gives the following suggestion:\n\n\u003e Following the Trivy compromise, more controls have been put in place regarding use of third party actions.\n\u003e\n\u003e The only allowed versions of this action are those in the repo:\n\u003e\n\u003e https://github.com/apache/infrastructure-actions\n\u003e\n\u003e In: https://raw.githubusercontent.com/apache/infrastructure-actions/refs/heads/main/actions.yml at the moment you can use :\n\u003e\n\u003e - docker/login-actionc94ce9fb468520275223c153574b00df6fe4bcc9\n\u003e - docker/login-actionb45d80f862d83dbcd57f89517bcf500b2ab88fb2\n\u003e - docker/login-action4907a6ddec9925e35a0a9e82d7399ccc52663121\n\u003e\n\u003e which correspond to these tagged versions:\n\u003e\n\u003e docker/login-action:\n\u003e   c94ce9fb468520275223c153574b00df6fe4bcc9:\n\u003e     tag: v3.7.0\n\u003e     expires_at: 2026-06-14\n\u003e   b45d80f862d83dbcd57f89517bcf500b2ab88fb2:\n\u003e     tag: v4.0.0\n\u003e     expires_at: 2026-07-05\n\u003e   4907a6ddec9925e35a0a9e82d7399ccc52663121:\n\u003e     tag: v4.1.0\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNo.\n\nCloses #3681 from SteNicholas/CELEBORN-2322.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "70ff956d049ab071d6ff93ff55acc4fea0a46635",
      "tree": "4ec39101e0bff5d72cf9181672fc6a778a04dfac",
      "parents": [
        "8d473c5af5a754a5b0329a805298ce9e8f0d27e7"
      ],
      "author": {
        "name": "Chao Sun",
        "email": "chao@openai.com",
        "time": "Sun May 10 12:35:50 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Sun May 10 12:35:50 2026 +0800"
      },
      "message": "[CELEBORN-2321] Avoid locking disk writers during memory split checks\n\n### Why are the changes needed?\n\n`needHardSplitForMemoryShuffleStorage()` runs on the push path. Disk-backed writers can never require this memory-only split check, but the method currently acquires the writer lock before returning `false`. For the common disk-backed case, that adds avoidable contention with writes and evictions on a hot path.\n\n### What changes were proposed in this PR?\n\nThis PR adds an unlocked fast path for non-memory writers so they return immediately without taking the `PartitionDataWriter` monitor. For memory-backed writers, it rechecks `currentTierWriter` after entering the synchronized block before evaluating the existing hard-split conditions, which preserves the original behavior if the writer tier changes concurrently.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n- Attempted `build/mvn -pl worker -am -DskipTests compile` on current `main`.\n- The Maven reactor fails before reaching `worker` because `celeborn-master_2.12` cannot resolve snapshot test-jar artifacts for `celeborn-common_2.12` and `celeborn-service_2.12`; that failure is unrelated to this change.\n\nCloses #3680 from sunchao/dev/chao/codex/celeborn-fast-memory-split-check-oss-main.\n\nAuthored-by: Chao Sun \u003cchao@openai.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "8d473c5af5a754a5b0329a805298ce9e8f0d27e7",
      "tree": "9e99c3d43bb897777dd9e364020dddd682aff94b",
      "parents": [
        "fc087567442427cb1b3b161d65a7d3eb7080c345"
      ],
      "author": {
        "name": "Kartikay Bhutani",
        "email": "kbhutani0001@gmail.com",
        "time": "Fri May 08 15:42:51 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri May 08 15:42:51 2026 +0800"
      },
      "message": "[CELEBORN-2318] Miss increment to WRITE_DATA_HARD_SPLIT_COUNT on returning HARD_SPLIT in handlePushData\n\n### What changes were proposed in this pull request?\nMissing increment to `WRITE_DATA_HARD_SPLIT_COUNT` on returning HARD_SPLIT\n\n### Why are the changes needed?\n- The post-restart detection branch in `handlePushData` (Case2: shuffleKey in storageManager but not in shuffleMapperAttempts) returns HARD_SPLIT without incrementing `WRITE_DATA_HARD_SPLIT_COUNT`\n- The sibling Case1 branch (line 398) and all other HARD_SPLIT return paths already increment it\n- This makes Case2 invisible to monitoring during rolling restarts\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nExisting UTs\n\nCloses #3676 from kaybhutani/kartikay/missing-hard-split-metric.\n\nAuthored-by: Kartikay Bhutani \u003ckbhutani0001@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "fc087567442427cb1b3b161d65a7d3eb7080c345",
      "tree": "4d19d6137a32b397790c76cfe251c0053b3e2175",
      "parents": [
        "69df893b4133d8594680a78396cc21171fe29a14"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri May 08 15:41:12 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri May 08 15:41:12 2026 +0800"
      },
      "message": "[CELEBORN-2320] Grafana dashboard linter should build from a checkout instead\n\n### What changes were proposed in this pull request?\n\nGrafana dashboard linter should build from a checkout instead.\n\nBackport https://github.com/grafana/dashboard-linter/pull/252.\n\n### Why are the changes needed?\n\nThis PR introduces [GoReleaser](https://github.com/goreleaser/goreleaser-action) to help create GitHub releases and compiled binaries.\n\nThe new release workflow will trigger on git tags starting with v.\n\nIn particular, providing pre-built release binaries addresses the following problem currently seen in main:\n\n```\ngo: downloading github.com/grafana/dashboard-linter v0.1.0\ngo: github.com/grafana/dashboard-linterlatest (in github.com/grafana/dashboard-linterv0.1.0):\n\tThe go.mod file for the module providing named packages contains one or\n\tmore replace directives. It must not contain directives that would cause\n\tit to be interpreted differently than if it were the main module.\n```\n\n`go install github.com/grafana/dashboard-linter\u003cversion\u003e` does not currently work because `go.mod` contains a `replace` directive — Go refuses to install a module with replaces. Build from a checkout instead:\n\n```\n$ git clone https://github.com/grafana/dashboard-linter.git\n$ cd dashboard-linter\n$ go build -o dashboard-linter .\n$ ./dashboard-linter lint dashboard.json\n```\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n`Grafana Dashboard CI / lint (pull_request)`.\n\nCloses #3679 from SteNicholas/CELEBORN-2320.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "69df893b4133d8594680a78396cc21171fe29a14",
      "tree": "36b50e89756ac8b7609e0ffcb55afe2c39d9ff88",
      "parents": [
        "59fd7a8402364a1b82a8480dc33e7a5722ec6d2f"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 22 14:19:35 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 22 14:19:35 2026 +0800"
      },
      "message": "[CELEBORN-2258][FOLLOWUP] Use IoHandlerFactories for EventLoopGroups to replace deprecated transport-specific event loop groups\n\n### What changes were proposed in this pull request?\n\nUse IoHandlerFactories for EventLoopGroups to replace deprecated transport-specific event loop groups.\n\nBackport: https://github.com/apache/spark/pull/52719.\n\n### Why are the changes needed?\n\nNetty 4.2 introduces some new APIs, and deprecates some old APIs. As part of your migration to Netty 4.2, we encourage you to look through your code base for opportunities to clean up any use of deprecated APIs.\n\n- **IoHandlerFactories for EventLoopGroups**\n\nAll transport-specific event loop groups, such as `NioEventLoopGroup`, have been deprecated. Integrators should now instead pass a transport-specific `IoHandlerFactory` to a `MultiThreadedEventLoopGroup` constructor.\n\nTherefore, Netty 4.2 upgrade could follow the best practices from https://netty.io/wiki/netty-4.2-migration-guide.html#new-best-practices.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3669 from SteNicholas/CELEBORN-2258.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "59fd7a8402364a1b82a8480dc33e7a5722ec6d2f",
      "tree": "ded487669ea9ff9de82d9b10512a6a5c3f1fc6f5",
      "parents": [
        "a56f69ae0abdf4e375b442aebfc6843fc8520bc9"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 22 10:44:24 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 22 10:44:24 2026 +0800"
      },
      "message": "[CELEBORN-2309] Introduce JavaDeserializerFilter to support deserialization filter\n\n### What changes were proposed in this pull request?\n\nIntroduce `JavaDeserializerFilter` — an allowlist-only deserialization filter that prevents CWE-502 (Deserialization of Untrusted Data) attacks on Celeborn\u0027s internal RPC channel.\n\nKey design points:\n- **Dual-layer defense:** On JDK 9+, both a `resolveClass`-based class allowlist and JVM-level `ObjectInputFilter` (resource limits: maxdepth, maxarray, maxrefs, maxbytes) are enforced. On JDK 8, only the `resolveClass` allowlist is active.\n- **Reflection-based JDK compatibility:** Uses reflection to access `java.io.ObjectInputFilter` APIs, gracefully degrading on JDK 8 where the API does not exist.\n- **Minimal overhead:** Filter pattern and logging proxy are built once at construction time; hot-path `isClassAllowed` is a simple `String.startsWith` loop over a `String[]` — no streams, no allocation.\n- **Configurable via CelebornConf:** Enabled by default with sensible defaults; operators can customize allowed packages and resource limits without code changes.\n\nDefault allowed package prefixes: `java.`, `scala.`, `org.apache.celeborn.`, `com.google.protobuf.`, `[` (arrays).\n\n### Why are the changes needed?\n\nCeleborn\u0027s internal RPC between Master, Workers, and clients uses Java serialization (`JavaSerializer`). Without a deserialization filter, an attacker with network access to RPC ports can craft a malicious serialized payload containing gadget-chain classes (e.g., from commons-collections, Spring, etc.) to achieve Remote Code Execution.\n\nThis is a well-known attack vector (CWE-502 / OWASP A8:2017). Adding an allowlist-only filter ensures that only classes from trusted packages can be deserialized, blocking arbitrary gadget-chain exploitation regardless of which libraries are on the classpath.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce any user-facing change?\n\nNo. The filter is transparent to existing functionality — all legitimate Celeborn RPC classes are under `org.apache.celeborn.` which is allowed by default. Operators gain new configuration knobs if they need to extend the allowlist:\n\n| Config Key | Default |\n|-----------|---------|\n| `celeborn.serializer.deserialization.filter.enabled` | `true` |\n| `celeborn.serializer.deserialization.filter.allowedPackages` | `java.,jdk.,sun.,scala.,org.apache.celeborn.,com.google.protobuf.,[` |\n| `celeborn.serializer.deserialization.filter.maxDepth` | `100` |\n| `celeborn.serializer.deserialization.filter.maxArrayLength` | `10000` |\n| `celeborn.serializer.deserialization.filter.maxReferences` | `100000` |\n| `celeborn.serializer.deserialization.filter.maxStreamBytes` | `100000000` |\n\n### How was this patch tested?\n\n- `JavaDeserializerFilterSuiteJ`: unit tests covering default allowlist, custom allowlist, valid deserialization round-trip, and rejection of blocked classes via crafted serialization payload.\n- Manual integration testing on JDK 8 (graceful degradation) and JDK 11/17 (full ObjectInputFilter enforcement).\n\nCloses #3664 from SteNicholas/CELEBORN-2309.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "a56f69ae0abdf4e375b442aebfc6843fc8520bc9",
      "tree": "9b677ce255803c2d8976c913fd2b86247576d966",
      "parents": [
        "95419e14a0d1986a4582a6364057193dce296992"
      ],
      "author": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Thu Apr 16 13:52:22 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 16 13:52:22 2026 +0800"
      },
      "message": "[MINOR] Update DingTalk Contact Info\n\n### What changes were proposed in this pull request?\n\nUpdate DingTalk contact information.\n\n### Why are the changes needed?\n\nDingTalk contact information has been expired.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNo.\n\nCloses #3665 from RexXiong/UPDATE_DINGTALK_INFO.\n\nAuthored-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "95419e14a0d1986a4582a6364057193dce296992",
      "tree": "50733b57b8b1f97dc32ec2fcde5d606baac8bcc6",
      "parents": [
        "149f3b98b8500627c2fd6432d5fb0cabf314fd9b"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 15 19:10:11 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 15 19:10:11 2026 +0800"
      },
      "message": "[CELEBORN-2301] MessageEncoder enables zero-copy sendfile for FileRegion in Netty native transports\n\n### What changes were proposed in this pull request?\n\n`MessageEncoder` enables zero-copy sendfile for `FileRegion` in Netty native transports, which emits the header `ByteBuf` and `FileRegion` as **separate objects** in the outbound message list when the body is a `FileRegion` backed by `FileSegmentManagedBuffer`, instead of wrapping them together in a `MessageWithHeader`.\n\nPreviously, all messages with a body were unconditionally wrapped in `MessageWithHeader`. This caused native transports (EPOLL, KQUEUE) to fall into a generic `FileRegion.transferTo()` fallback path that copies data through user-space, bypassing the optimized `sendfile()` / `splice()` zero-copy path that Netty\u0027s native transports provide.\n\nThe split is only applied when the `ManagedBuffer` is a `FileSegmentManagedBuffer`, whose `release()` is a no-op, making it safe to emit the `FileRegion` independently of write lifecycle management. Other `ManagedBuffer` types (e.g., `BlockManagerManagedBuffer`) still use the `MessageWithHeader` wrapper because they perform resource cleanup in `release()` that must be tied to `MessageWithHeader.deallocate()`.\n\nBackport: https://github.com/apache/spark/pull/55087.\n\n### Why are the changes needed?\n\nWhen using native transports (AUTO/EPOLL on Linux), file-backed shuffle fetch performance was severely degraded compared to NIO mode. The root cause lies in how Netty\u0027s native transports dispatch `FileRegion` writes.\n\nIn `AbstractEpollStreamChannel.doWriteSingle()` (and the analogous KQueue path), Netty uses an `instanceof` check to choose between two write strategies:\n\nhttps://github.com/netty/netty/blob/eeb5674526f0b49a142580686a5a9a7147ddadec/transport-classes-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java#L474-L493\n\n```java\n} else if (msg instanceof DefaultFileRegion) {\n    return writeDefaultFileRegion(in, (DefaultFileRegion) msg);  // → socket.sendFile() (zero-copy)\n} else if (msg instanceof FileRegion) {\n    return writeFileRegion(in, (FileRegion) msg);                // → region.transferTo() (user-space copy)\n}\n```\n\n- **`writeDefaultFileRegion()`** calls `socket.sendFile()`, which maps directly to the Linux `sendfile()` syscall — a true zero-copy path where data is transferred from the file page cache to the socket buffer entirely within the kernel, with no user-space copy.\n\nhttps://github.com/netty/netty/blob/eeb5674526f0b49a142580686a5a9a7147ddadec/transport-classes-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java#L367-L386\n\n```java\n    private int writeDefaultFileRegion(ChannelOutboundBuffer in, DefaultFileRegion region) throws Exception {\n        final long offset \u003d region.transferred();\n        final long regionCount \u003d region.count();\n        if (offset \u003e\u003d regionCount) {\n            in.remove();\n            return 0;\n        }\n\n        final long flushedAmount \u003d socket.sendFile(region, region.position(), offset, regionCount - offset);\n        if (flushedAmount \u003e 0) {\n            in.progress(flushedAmount);\n            if (region.transferred() \u003e\u003d regionCount) {\n                in.remove();\n            }\n            return 1;\n        } else if (flushedAmount \u003d\u003d 0) {\n            validateFileRegion(region, offset);\n        }\n        return WRITE_STATUS_SNDBUF_FULL;\n    }\n```\n\n- **`writeFileRegion()`** falls back to `region.transferTo(WritableByteChannel)`, which writes data through a `SocketWritableByteChannel` wrapper — effectively a user-space copy path.\n\nhttps://github.com/netty/netty/blob/eeb5674526f0b49a142580686a5a9a7147ddadec/transport-classes-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java#L402-L420\n\n```java\nprivate int writeFileRegion(ChannelOutboundBuffer in, FileRegion region) throws Exception {\n        if (region.transferred() \u003e\u003d region.count()) {\n            in.remove();\n            return 0;\n        }\n\n        if (byteChannel \u003d\u003d null) {\n            byteChannel \u003d new EpollSocketWritableByteChannel();\n        }\n        final long flushedAmount \u003d region.transferTo(byteChannel, region.transferred());\n        if (flushedAmount \u003e 0) {\n            in.progress(flushedAmount);\n            if (region.transferred() \u003e\u003d region.count()) {\n                in.remove();\n            }\n            return 1;\n        }\n        return WRITE_STATUS_SNDBUF_FULL;\n    }\n```\n\nSpark\u0027s `MessageWithHeader extends AbstractFileRegion` (not `DefaultFileRegion`). When `MessageEncoder` wraps a `DefaultFileRegion` body inside `MessageWithHeader`, the resulting object is a generic `FileRegion` from Netty\u0027s perspective. This means Netty dispatches it to the `writeFileRegion()` fallback, which calls `MessageWithHeader.transferTo()`:\n\n```java\n// MessageWithHeader.java, line 121\nif (body instanceof FileRegion fileRegion) {\n    writtenBody \u003d fileRegion.transferTo(target, totalBytesTransferred - headerLength);\n}\n```\n\nHere, even though the inner body is a `DefaultFileRegion`, its `transferTo()` is invoked with a `WritableByteChannel` (not a file descriptor), so the data is read from the file into a user-space buffer and then written to the socket — **the zero-copy opportunity is lost**.\n\nBy emitting the `DefaultFileRegion` directly into Netty\u0027s outbound buffer (instead of wrapping it in `MessageWithHeader`), Netty\u0027s native transport recognizes it via `instanceof DefaultFileRegion` and routes it to `socket.sendFile()`, restoring the zero-copy `sendfile()` path.\n\n**Benchmark results (File-Backed Shuffle Fetch) show dramatic improvement:**\n\n| Scenario | Before (ms) | After (ms) | Improvement |\n|---|---|---|---|\n| EPOLL, sequential fetch (JDK8) | 524 | 134 | **~3.9x faster** |\n| EPOLL, parallel fetch (JDK8) | 191 | 55 | **~3.4x faster** |\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo. This is an internal optimization to the Netty transport layer. Users benefit from improved shuffle fetch performance when using native transports (the default on Linux) without any configuration changes.\n\n### How was this patch tested?\n\n- Re-ran `NettyTransportBenchmark` with JDK8 to confirm the performance improvement. Updated benchmark result files accordingly.\n\nCloses #3649 from SteNicholas/CELEBORN-2301.\n\nLead-authored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nCo-authored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "149f3b98b8500627c2fd6432d5fb0cabf314fd9b",
      "tree": "276c06da6a28836edc202846d30e9ffb497bd48b",
      "parents": [
        "c456df3ec6a4611895dcc7dfe4167ef1e261560c"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Wed Apr 15 15:07:35 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 15 15:07:35 2026 +0800"
      },
      "message": "[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id\n\n### What changes were proposed in this pull request?\n\n- Added a new mapping for celebornShuffleId -\u003e appShuffleId\n- cancelAllActiveStages should passing appShuffleId not celebornShuffleId\n\n### Why are the changes needed?\n\n`shuffleAllocatedWorkers` worker contains celebornShuffleId, we need to use `appShuffleId` because DAGScheduler only understand app shuffle id.\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nNA\n\nCloses #3662 from s0nskar/fix_quota_shuffle_id.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c456df3ec6a4611895dcc7dfe4167ef1e261560c",
      "tree": "222b50934eec1eb54bd9e91a7fdf21dd03abfc02",
      "parents": [
        "913d027efd1e56d080ec8080818e05ccf8eaa025"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Apr 13 10:26:56 2026 +0700"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Apr 13 10:26:56 2026 +0700"
      },
      "message": "[CELEBORN-2298] Introduce NettyTransportBenchmark for Netty transport layer performance evaluation\n\n### What changes were proposed in this pull request?\n\nIntroduce `NettyTransportBenchmark` for Netty transport layer performance evaluation.\n\nAll suites measure performance through the actual Celeborn transport pipeline\n(`TransportServer` + `TransportClientFactory` + `TransportContext`).\n\nSuite overview:\n1. RPC Latency            - server-client RPC overhead at different payload sizes\n2. Concurrent Throughput  - multi-client pressure on the transport layer\n3. IOMode Comparison      - NIO vs native transport (Automatically selects EPOLL/KQUEUE)\n4. Server Thread Scaling  - validates MAX_DEFAULT_NETTY_THREADS\u003d8 cap\n5. Multi-Connection       - numConnectionsPerPeer\u003d1 vs 2 vs 4\n6. Async Write Pressure   - fire-and-forget RPCs to saturate the write path\n7. Large Block Transfer   - shuffle-like 16MB block transfers (in-memory payload)\n8. File-Backed Shuffle    - ChunkFetch from disk, NIO vs native transport (EPOLL sendfile bypass detection)\n\nBackport: https://github.com/apache/spark/pull/55061.\n\n### Why are the changes needed?\n\nNetty is a crucial third-party component for Celeborn. Introduce a micro-benchmark test facilitates:\n\n- Verify performance during subsequent Netty upgrades;\n- Validate performance after changes to relevant code in Celeborn.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nGenerated `NettyTransportBenchmark-results.txt`.\n\nCloses #3647 from SteNicholas/CELEBORN-2298.\n\nLead-authored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nCo-authored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "913d027efd1e56d080ec8080818e05ccf8eaa025",
      "tree": "9a4cb97d0b6087452de4264ca509e447f806deca",
      "parents": [
        "163bcb36edc11d1b999827a4e06bfcdd3bc7b3ea"
      ],
      "author": {
        "name": "zhengtao",
        "email": "shuaizhentao.szt@alibaba-inc.com",
        "time": "Sat Apr 11 11:44:19 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Sat Apr 11 11:44:19 2026 +0800"
      },
      "message": "[CELEBORN-2287] Split mode should be HARD_SPLIT when disk is full\n\n### What changes were proposed in this pull request?\n\nChange the split mode to `HARD_SPLIT` when disk is full.\n\n### Why are the changes needed?\n\nWhen the disk is already in a full state, continuous writing data in `SOFT_SPLIT` mode may cause the reserved space to be filled up as well.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nManual test and UT.\n\nCloses #3653 from zaynt4606/clb2287.\n\nAuthored-by: zhengtao \u003cshuaizhentao.szt@alibaba-inc.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "163bcb36edc11d1b999827a4e06bfcdd3bc7b3ea",
      "tree": "61a031efca759baf3b9a72d7e137dd8b6ff3c7e2",
      "parents": [
        "a30166df08f884869ef3c0ef03829cd21bed8814"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Fri Apr 10 14:41:33 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Apr 10 14:41:33 2026 +0800"
      },
      "message": "[CELEBORN-2297] Update workflow to manually install Helm and chart-testing\n\n### What changes were proposed in this pull request?\n\n### Why are the changes needed?\n```\nThe actions azure/setup-helmv4.2.0, docker/setup-buildx-actionv1, and docker/build-push-actionv2 are not allowed in apache/celeborn\n```\nhttps://github.com/apache/celeborn/actions/runs/23730849688\n\n```yml\n      - name: Setup chart-testing\n        uses: ./.github/actions/chart-testing-action\n```\n\n```\nThe action sigstore/cosign-installer11086d25041f77fe8fe7b9ea4e48e3b9192b8f19 is not allowed in apache/celebor\n```\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nGHA\nhttps://github.com/apache/celeborn/actions/runs/23884713736/job/69645124360?pr\u003d3639\n\nCloses #3639 from cxzl25/fix_it_test.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "a30166df08f884869ef3c0ef03829cd21bed8814",
      "tree": "973a7f4d526ed2802804873e13bf46f90542f60e",
      "parents": [
        "c688c76d3c8de471afbf8c5fd1857b0a5926a721"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Apr 10 14:39:51 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Apr 10 14:39:51 2026 +0800"
      },
      "message": "[CELEBORN-2305] Bump Ratis version from 3.2.1 to 3.2.2\n\n### What changes were proposed in this pull request?\n\nBump Ratis version from 3.2.1 to 3.2.2.\n\n### Why are the changes needed?\n\nBump Ratis version from 3.2.1 to 3.2.2. Ratis has released v3.2.2 of which release note refers to [3.2.2](https://ratis.apache.org/post/3.2.2.html). The 3.2.2 version is a maintenance release with multiple improvements and bugfixes.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3658 from SteNicholas/CELEBORN-2305.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c688c76d3c8de471afbf8c5fd1857b0a5926a721",
      "tree": "6f36aa996d8ad62c0644e384f0492f4b05c96a04",
      "parents": [
        "c246031889495c45c3e96cf682f2378382411e0e"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 09 20:35:35 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 09 20:35:35 2026 +0800"
      },
      "message": "[CELEBORN-2063][FOLLOWUP] Fix timeout unit for parallel creation of partition writer\n\n### What changes were proposed in this pull request?\n\nFix timeout unit for parallel creation of partition writer in `Utils#tryFuturesWithTimeout`.\n\nFollow up #3387, #3656.\n\n### Why are the changes needed?\n\n`Utils#tryFuturesWithTimeout` uses wrong timeout unit which does not match the config option `celeborn.worker.writer.create.parallel.timeout` as follows:\n\n```\nval WORKER_WRITER_CREATE_PARALLEL_TIMEOUT: ConfigEntry[Long] \u003d\n    buildConf(\"celeborn.worker.writer.create.parallel.timeout\")\n      .categories(\"worker\")\n      .version(\"0.6.3\")\n      .doc(\"Timeout for a worker to create a file writer in parallel.\")\n      .timeConf(TimeUnit.MILLISECONDS)\n      .createWithDefaultString(\"120s\")\n```\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3657 from SteNicholas/CELEBORN-2063.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c246031889495c45c3e96cf682f2378382411e0e",
      "tree": "e424bf054b33dd198e3b8dfb00c8835e387d792e",
      "parents": [
        "84830d9fcfe1067c050774d8f21a121b4b22911a"
      ],
      "author": {
        "name": "Xianming Lei",
        "email": "xianming.lei@shopee.com",
        "time": "Thu Apr 09 14:49:01 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 09 14:49:01 2026 +0800"
      },
      "message": "[CELEBORN-2294][FOLLOWUP] Fix flaky test SparkUtilsSuite\n\n### What changes were proposed in this pull request?\nRemove racy assertions in the \"check if fetch failure task another attempt is running or successful\" test in SparkUtilsSuite.\n\n### Why are the changes needed?\n\nAfter CELEBORN-2294 added a zombie TaskSetManager check in shouldReportShuffleFetchFailure, the test became flaky due to a race condition. The test calls shouldReportShuffleFetchFailure a second time from the test thread, but by that point the FetchFailed has already been processed by Spark\u0027s DAGScheduler — **the TaskSetManager is either marked as zombie** or the task has been removed from taskIdToTaskSetManager. This causes the second call to return false, failing the assertion.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nThe affected UT: SparkUtilsSuite - check if fetch failure task another attempt is running or successful.\n\nCloses #3655 from leixm/FOLLOW-CELEBORN-2294.\n\nAuthored-by: Xianming Lei \u003cxianming.lei@shopee.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "84830d9fcfe1067c050774d8f21a121b4b22911a",
      "tree": "dc44136c990291c6f63c631bd1d609e8aac4f62a",
      "parents": [
        "7f1bac3443f191d2bc5dcebebb88cd1744b53cd7"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Thu Apr 09 08:24:21 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 09 08:24:21 2026 +0800"
      },
      "message": "[CELEBORN-2304] Fix timeout unit mismatch in disk monitor check\n\n### What changes were proposed in this pull request?\nChanged tryWithTimeoutAndCallback and tryFutureWithTimeoutAndCallback in Utils.scala to accept timeout in milliseconds instead of seconds.\n\n### Why are the changes needed?\nWORKER_DEVICE_STATUS_CHECK_TIMEOUT is configured in milliseconds (e.g. 30s → 30000), but was passed directly to Future.get(..., TimeUnit.SECONDS)\n\nhttps://github.com/apache/celeborn/blob/7f1bac3443f191d2bc5dcebebb88cd1744b53cd7/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala#L4100-L4107\n\nhttps://github.com/apache/celeborn/blob/7f1bac3443f191d2bc5dcebebb88cd1744b53cd7/worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/DeviceMonitor.scala#L287-L291\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nGHA\n\nCloses #3656 from cxzl25/CELEBORN-2304.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "7f1bac3443f191d2bc5dcebebb88cd1744b53cd7",
      "tree": "ee96a224a2ce2d92e94077cdb95006889bf12149",
      "parents": [
        "234ff2d705bd5a6c4a6603d3976af8a3615bcbf9"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Wed Apr 08 11:06:35 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 08 11:06:35 2026 +0800"
      },
      "message": "[CELEBORN-2302] Fix NPE in MemoryManager.close() when readBufferDispatcher is not initialized\n\n### What changes were proposed in this pull request?\nAdd a null check for readBufferDispatcher before calling close() in MemoryManager.close().\n\n### Why are the changes needed?\nreadBufferDispatcher is only initialized when readBufferThreshold \u003e 0.\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nGHA\n\nCloses #3654 from cxzl25/CELEBORN-2302.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "234ff2d705bd5a6c4a6603d3976af8a3615bcbf9",
      "tree": "bd828e62cf9201f8e95a60c352375cb4c1d7d2ac",
      "parents": [
        "ca8533c8390bfcdcb418144e9fc95c3561fe641a"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Wed Apr 08 11:04:32 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 08 11:04:32 2026 +0800"
      },
      "message": "[CELEBORN-2219][CIP-14] Support PushMergedData in CppClient\n\n### What changes were proposed in this pull request?\n\nImplement PushMergedData functionality in the C++ client, enabling batch merging and pushing of shuffle data grouped by worker address.\n\nKey changes:\n  - Add mergeData() and pushMergedData() to ShuffleClient, which accumulateper partition data batches and push them as merged payloads when the buffer threshold is exceeded or at mapper end.\n  - Introduce DataBatches class to manage batch accumulation thread-safe add/take operations and size-bounded requireBatches().\n  - Add PushMergedDataCallback to handle success responses (split handling, congestion control, MAP_ENDED) and failure responses with revive-based retry via submitRetryPushMergedData().\n  - Add PushMergedData network message type with encoding for partitionUniqueIds and batchOffsets arrays.\n  - Extend Encoders with encode/decode support for vector\u003cstring\u003e and vector\u003cint32_t\u003e.\n  - Add pushMergedDataAsync() to TransportClient.\n  - Add unit tests for DataBatches, PushMergedData message encoding, and array encoders.\n\n### Why are the changes needed?\n\nThis is needed to extend the functionality of C++ Client and there is a Bolt dependency on it with https://github.com/bytedance/bolt/issues/370\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nYes because it is a new functionality in the c++ client\n\n### How was this patch tested?\n\nTested through running unit tests and compiling locally.\n\nCloses #3611 from afterincomparableyum/cpp-client/celeborn-2219.\n\nAuthored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "ca8533c8390bfcdcb418144e9fc95c3561fe641a",
      "tree": "05ca0cca878a6e1c7208e0dcee9be35ab8c365c5",
      "parents": [
        "37aed7ea5421e0de9bfdfbd452746103a579232e"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Wed Apr 08 10:55:07 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 08 10:55:07 2026 +0800"
      },
      "message": "[CELEBORN-2300] Change the default value of celeborn.port.maxRetries from 1 to 16\n\n### What changes were proposed in this pull request?\n\n### Why are the changes needed?\n\nAlign with spark.port.maxRetries, default to 16 retries.\n\n```java\nERROR Executor: Exception in task 1153.0 in stage 11763.0 (TID 516674)\njava.lang.RuntimeException: java.net.BindException: Address already in use: Service \u0027ShuffleClient\u0027 failed after 1 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service \u0027ShuffleClient\u0027 to the correct binding address.\n\tat org.apache.spark.shuffle.celeborn.SparkShuffleManager.getWriter(SparkShuffleManager.java:310)\n```\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nYes\n\n### How was this patch tested?\nGHA\n\nCloses #3648 from cxzl25/CELEBORN-2300.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "37aed7ea5421e0de9bfdfbd452746103a579232e",
      "tree": "fb701374002a3929cc7008693d851ed9e2d8d546",
      "parents": [
        "37a27bc4cb3c993289e5b94439db224572cd3e66"
      ],
      "author": {
        "name": "Xianming Lei",
        "email": "jerrylei@apache.org",
        "time": "Fri Apr 03 18:00:01 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Apr 03 18:00:01 2026 +0800"
      },
      "message": "[CELEBORN-2295] CommitHandler should support retry interval\n\n### What changes were proposed in this pull request?\n\n`CommitHandler` should support retry interval for retry of committing file.\n\n### Why are the changes needed?\n\nWhen commitFiles RPC fails, the current implementation retries immediately without any backoff. If the worker is experiencing transient network issues, immediate retries are likely to fail again. Adding a configurable retry interval (`celeborn.client.requestCommitFiles.retryInterval`, default 10s) gives the worker time to recover before the next attempt, significantly improving the success rate of retries. A dedicated `ScheduledExecutorService` is used to avoid blocking threads in the shared RPC pool during the wait.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3642 from leixm/CELEBORN-2295.\n\nLead-authored-by: Xianming Lei \u003cjerrylei@apache.org\u003e\nCo-authored-by: Xianming Lei \u003cxianming.lei@shopee.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "37a27bc4cb3c993289e5b94439db224572cd3e66",
      "tree": "727807a0fb7165ebcdf3eb56dd1873bac849ba37",
      "parents": [
        "097f1df172e7ed8f2d32c3c5a8c04b22a595fe25"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Thu Apr 02 10:56:26 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 02 10:56:26 2026 +0800"
      },
      "message": "[CELEBORN-2296] Fix race condition in MemoryManager singleton initialization\n\n### What changes were proposed in this pull request?\nApply double-checked locking to MemoryManager.initialize() and add synchronized to MemoryManager.reset() to make the singleton lifecycle thread-safe.\n\n### Why are the changes needed?\n\n```java\n26/04/01 10:32:20,050 ERROR [worker 3 starter thread] WordCountTestWithAuthentication: create worker failed, detail:\njava.lang.NullPointerException\n        at org.apache.celeborn.service.deploy.worker.memory.ChannelsLimiter.\u003cinit\u003e(ChannelsLimiter.java:52)\n        at org.apache.celeborn.service.deploy.worker.Worker.\u003cinit\u003e(Worker.scala:240)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature.createWorker(MiniClusterFeature.scala:172)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature.createWorker$(MiniClusterFeature.scala:153)\n        at org.apache.celeborn.tests.flink.WordCountTestBase.createWorker(WordCountTest.scala:44)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature.createWorker(MiniClusterFeature.scala:150)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature.createWorker$(MiniClusterFeature.scala:149)\n        at org.apache.celeborn.tests.flink.WordCountTestBase.createWorker(WordCountTest.scala:44)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$2(MiniClusterFeature.scala:221)\n        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature$RunnerWrap.$anonfun$run$1(MiniClusterFeature.scala:50)\n        at org.apache.celeborn.common.util.Utils$.tryLogNonFatalError(Utils.scala:234)\n        at org.apache.celeborn.service.deploy.MiniClusterFeature$RunnerWrap.run(MiniClusterFeature.scala:50)\n```\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nGHA\n\nCloses #3643 from cxzl25/CELEBORN-2296.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "097f1df172e7ed8f2d32c3c5a8c04b22a595fe25",
      "tree": "f6dc3c19226cdf0132931530f905a7dcb0236f6b",
      "parents": [
        "c3adbca3cd02f5f34ba120f86bffbee71fe78e03"
      ],
      "author": {
        "name": "Xianming Lei",
        "email": "jerrylei@apache.org",
        "time": "Thu Apr 02 10:52:28 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 02 10:52:28 2026 +0800"
      },
      "message": "[CELEBORN-2294] The shuffle fetch failed report from the zombie stage should be ignored\n\n### What changes were proposed in this pull request?\nThe shuffle fetch failed report from the zombie stage should be ignored\n\n### Why are the changes needed?\nWithout this PR, if a stage attempt has already triggered FetchFailed, there will still be running tasks reporting fetch failed to LifeCycleManager, which will cause the current Stage Attempt to mistakenly trigger a stage rerun.\n\nSpark also ignores FetchFailed from the previous stage attempt, and Celeborn should keep the same logic.\n\u003cimg width\u003d\"1760\" height\u003d\"664\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/201cd849-475b-442f-bec1-0b8ef1048036\" /\u003e\n\n### Does this PR resolve a correctness bug?\nNo.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nExisting UTs.\n\nCloses #3640 from leixm/main.\n\nAuthored-by: Xianming Lei \u003cjerrylei@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "c3adbca3cd02f5f34ba120f86bffbee71fe78e03",
      "tree": "379b2a992c9f7abe7206a3f38ffccaa972aead0d",
      "parents": [
        "d3b75132daeaa0484ae2562fe2a9c9fadfa70c92"
      ],
      "author": {
        "name": "Kartikay Bhutani",
        "email": "kbhutani0001@gmail.com",
        "time": "Thu Apr 02 10:50:09 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Apr 02 10:50:09 2026 +0800"
      },
      "message": "[CELEBORN-2293] Fix ConcurrentModificationException in WorkerStatusTracker.shuttingWorkers\n\n### What changes were proposed in this pull request?\n\nReplace `HashSet` with `ConcurrentHashMap.newKeySet()` for `shuttingWorkers` in `WorkerStatusTracker`.\n\n### Why are the changes needed?\n\nWhen multiple shuffles hit a shutting-down worker simultaneously, one thread iterates `shuttingWorkers` in `currentFailedWorkers()` (for logging) while another modifies it in `recordWorkerFailure()`.\n\n### Does this PR resolve a correctness bug?\n\nYes.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nAdded a unit test in `WorkerStatusTrackerSuite` and ran the same. Was able to reproduce only once (without the fix) out of multiple runs.\n\nCloses #3638 from kaybhutani/fix-concurrent-shuttingworkers.\n\nLead-authored-by: Kartikay Bhutani \u003ckbhutani0001@gmail.com\u003e\nCo-authored-by: kartikay \u003ckbhutani0001@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "d3b75132daeaa0484ae2562fe2a9c9fadfa70c92",
      "tree": "f3c10bf4c3e9baa4af5cde522c24052379c37a6f",
      "parents": [
        "42f1a08c1bcf0ba117d05d91a766d1214e5b881f"
      ],
      "author": {
        "name": "luogen.lg",
        "email": "luogen.lg@alibaba-inc.com",
        "time": "Wed Apr 01 15:48:42 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Apr 01 15:48:42 2026 +0800"
      },
      "message": "[CELEBORN-2292] Fix ArithmeticException when PUSH_DATA_HAND_SHAKE fails before any data written\n\n### What changes were proposed in this pull request?\n\nNOTE: This is the same patch with #3637 pushing to main branch. Because some code has been refactored, the original patch can not be simply cherry-picked.\n\nHandle the case where numSubpartitions is zero in MapPartitionDataReader.open(). When the partition is empty, treat it as a normal empty partition and notify consumers accordingly.\n\n### Why are the changes needed?\n\nWhen the first PUSH_DATA_HAND_SHAKE request fails (e.g., timeout), client triggers revive with reason HARD_SPLIT. Manager adds the failed partition to partition locations, but numSubpartitions remains uninitialized (zero). Reading such partition causes ArithmeticException: / by zero.\nSince this is caused by client-side behavior, we handle it on worker side first for cross-version compatibility. The issue that flink shuffle client revives with fixed reason HARD_SPLIT can be addressed in later PRs.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce any user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nManually tested with a hacked version that throws exception on the first handshake invocation. But the test code is too hacky to included into this PR. Advices are welcomed on how to add a proper unit test for this scenario without introducing too much complexity.\n\nCloses #3641 from pltbkd/CELEBORN-2292-on-main.\n\nAuthored-by: luogen.lg \u003cluogen.lg@alibaba-inc.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "42f1a08c1bcf0ba117d05d91a766d1214e5b881f",
      "tree": "a0a969901d16522734ca7c9e80e34e302a9f1166",
      "parents": [
        "235f07de49339d4cc2d5ac94deb55106a79ca7b4"
      ],
      "author": {
        "name": "Aravind Patnam",
        "email": "akpatnam25@gmail.com",
        "time": "Mon Mar 30 14:20:21 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 30 14:20:21 2026 +0800"
      },
      "message": "[CELEBORN-2284] Fix TLS Memory Leak\n\n### What changes were proposed in this pull request?\nWhile running jobs with TLS enabled, we encountered memory leaks which cause worker OOMs.\n```\n26/02/13 21:02:52,779 ERROR [push-server-9-9] ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it\u0027s garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.\nRecent access records:\nCreated at:\n\tio.netty.buffer.AbstractByteBufAllocator.compositeDirectBuffer(AbstractByteBufAllocator.java:224)\n\tio.netty.buffer.AbstractByteBufAllocator.compositeBuffer(AbstractByteBufAllocator.java:202)\n\torg.apache.celeborn.common.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:143)\n\torg.apache.celeborn.common.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:66)\n\tio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)\n\tio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)\n\tio.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)\n\tio.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)\n\tio.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)\n\tio.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)\n\tio.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)\n\tio.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)\n\tio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)\n\tio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)\n\tio.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)\n\tio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tio.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tio.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tio.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)\n\tio.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\tio.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\tio.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tio.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tio.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tio.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tjava.base/java.lang.Thread.run(Thread.java:840)\n\n```\n\nWhen a Celeborn worker receives a PushData or PushMergedData message, it replicates that frame to a secondary worker for fault tolerance. On an SSL-enabled cluster this replication goes through SslMessageEncoder.encode(). Here is the flow of what happens inside SslMessageEncoder.encode():\n\n- The encoder asks the message body for an SSL-friendly copy by calling convertToNettyForSsl(). For shuffle data, the body is a NettyManagedBuffer — data already loaded in off-heap memory. This call runs buf.duplicate().retain(), which creates a second reference to the same memory and increments the reference count from 1 to 2.\n\n- The encoder places this second reference inside a composite buffer and hands it to Netty for writing.\n\n- Netty writes the composite to the network, then releases it — decrementing the count from 2 to 1.\n\n- Nothing releases the original NettyManagedBuffer\u0027s hold on the data, so the count stays at 1 forever.\n\n- This results in every replicated PushData frame leaking a chunk of off-heap memory, eventually causing OOM and worker crash.\n\nThe fix for this issue is to release the original message body, so that the net reference count is preserved. The second reference — now living inside the composite buffer in out — keeps the memory alive while Netty writes it to the network. When Netty finishes and releases the composite, the count reaches 0 and the memory is freed cleanly.\n\nThis is exactly what the non-SSL MessageEncoder already does via MessageWithHeader.deallocate() — the SSL path simply needed to replicate that behavior explicitly.\n\n### Why are the changes needed?\nfix memory leak\n\n### Does this PR resolve a correctness bug?\n\n### Does this PR introduce _any_ user-facing change?\nno\n\n### How was this patch tested?\nalready internally in production and tested.\nAlso added unit tests\n\nCloses #3630 from akpatnam25/CELEBORN-2284.\n\nAuthored-by: Aravind Patnam \u003cakpatnam25@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "235f07de49339d4cc2d5ac94deb55106a79ca7b4",
      "tree": "5070577cb04f73b7a0887f4809a90f912104a227",
      "parents": [
        "6fc5565319c6622aef8637a5cf5fa348fc763745"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Mon Mar 30 10:37:43 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 30 10:37:43 2026 +0800"
      },
      "message": "[CELEBORN-2277] Replace synchronized in Flusher.getWorkerIndex with AtomicInteger\n\n### What changes were proposed in this pull request?\n\nReplace the synchronized block in getWorkerIndex with an AtomicInteger.updateAndGet call using a CAS-based atomic operation.\n\n### Why are the changes needed?\n\nThe synchronized keyword locks the entire object and causes thread contention under high concurrency. Using AtomicInteger reduces lock scope to a single variable and avoids blocking overhead for this lightweight index increment operation.\n\n### Does this PR resolve a correctness bug?\nNo.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\n\nGHA.\n\nCloses #3621 from cxzl25/CELEBORN-2277.\n\nLead-authored-by: sychen \u003csychen@ctrip.com\u003e\nCo-authored-by: cxzl25 \u003c3898450+cxzl25@users.noreply.github.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "6fc5565319c6622aef8637a5cf5fa348fc763745",
      "tree": "9f614a4b7cfcddecbe4e95e2687f0ce6eb1b65d2",
      "parents": [
        "3773c6568a3041bab2294e4a2e31ce6a42a05ab6"
      ],
      "author": {
        "name": "Kartikay Bhutani",
        "email": "kbhutani0001@gmail.com",
        "time": "Fri Mar 27 10:35:55 2026 +0800"
      },
      "committer": {
        "name": "子懿",
        "email": "ziyi.jxf@antgroup.com",
        "time": "Fri Mar 27 10:35:55 2026 +0800"
      },
      "message": "[CELEBORN-2291] Support fsync on commit to ensure shuffle data durability\n\n### What changes were proposed in this pull request?\n  Add a new configuration `celeborn.worker.commitFiles.fsync` (default `false`) that calls `FileChannel.force(false)` (fdatasync) before closing the channel in\n   `LocalTierWriter.closeStreams()`.\n\n  ### Why are the changes needed?\n\n  Without this, committed shuffle data can sit in the OS page cache before the kernel flushes it to disk. A hard crash in that window loses data even though Celeborn considers it committed. This option lets operators opt into stronger durability guarantees.\n\n  ### Does this PR resolve a correctness bug?\n\n  No. It adds an optional durability enhancement.\n\n  ### Does this PR introduce _any_ user-facing change?\n\n  Yes. New configuration key `celeborn.worker.commitFiles.fsync` (boolean, default `false`).\n\n  ### How was this patch tested?\n\n  Existing unit tests. Configuration verified via `ConfigurationSuite` and for LocalTierWriter added a new test with fsync enabled and ran `TierWriterSuite`.\n\nAdditional context: [slack](https://apachecelebor-kw08030.slack.com/archives/C04B1FYS6SY/p1774259245973229)\n\nCloses #3635 from kaybhutani/kartikay/fsync-on-commit.\n\nAuthored-by: Kartikay Bhutani \u003ckbhutani0001@gmail.com\u003e\nSigned-off-by: 子懿 \u003cziyi.jxf@antgroup.com\u003e\n"
    },
    {
      "commit": "3773c6568a3041bab2294e4a2e31ce6a42a05ab6",
      "tree": "e026d2ebb409123e181430dafaa2507b7548a211",
      "parents": [
        "28a0733bb2441f98531f531b26800c0d3ea06e99"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Mon Mar 23 14:02:54 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 23 14:02:54 2026 +0800"
      },
      "message": "[CELEBORN-2285] Bump maven 3.9.14\n\n### What changes were proposed in this pull request?\n\n### Why are the changes needed?\nhttps://maven.apache.org/docs/3.9.14/release-notes.html\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nGHA\n\nCloses #3634 from cxzl25/CELEBORN-2285.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "28a0733bb2441f98531f531b26800c0d3ea06e99",
      "tree": "397acb36495a4f97d5b4677df79c952f49b7f82d",
      "parents": [
        "5fc0f199a4080c407bcb2bd5225b6e1edafaadf8"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Thu Mar 19 11:12:12 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Mar 19 11:12:12 2026 +0800"
      },
      "message": "[CELEBORN-2276] Fix race condition in MemoryManager.releaseSortMemory\n\n### What changes were proposed in this pull request?\nUse `updateAndGet`\n\n### Why are the changes needed?\n`reserveSortMemory` calls `sortMemoryCounter.addAndGet` without any synchronized, so the lock in `releaseSortMemory` doesn\u0027t actually protect against races with `reserveSortMemory` anyway.\n\n### Does this PR resolve a correctness bug?\nNo\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nGHA\n\nCloses #3620 from cxzl25/CELEBORN-2276.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "5fc0f199a4080c407bcb2bd5225b6e1edafaadf8",
      "tree": "79b2c6b2efc645803668eaca45a08812b2f97b0b",
      "parents": [
        "b4cb5a0b1ac097d33baf8dded1b5be2afd0578a4"
      ],
      "author": {
        "name": "yew1eb",
        "email": "yew1eb@gmail.com",
        "time": "Wed Mar 18 22:22:55 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Wed Mar 18 22:22:55 2026 +0800"
      },
      "message": "[CELEBORN-2282] Eliminate redundant HashMap lookups in CelebornInputStream#fillBuffer\n\n### What changes were proposed in this pull request?\n\nReplace three redundant HashMap operations (`containsKey` + `put` + `get`) with a single `computeIfAbsent` call in `CelebornInputStream#fillBuffer`.\n\n### Why are the changes needed?\n\n`fillBuffer` is on the hot shuffle-read path and called for every batch read. The original code performs up to three HashMap lookups per batch instead of one, causing unnecessary CPU overhead at scale.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting unit tests.\n\nCloses #3627 from yew1eb/CELEBORN-2282.\n\nAuthored-by: yew1eb \u003cyew1eb@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "b4cb5a0b1ac097d33baf8dded1b5be2afd0578a4",
      "tree": "d24110ceb4550b5f3dbb920f48673316e9f11f79",
      "parents": [
        "7b25797fb8cf08c3495419b1d872663b08f829fb"
      ],
      "author": {
        "name": "yew1eb",
        "email": "yew1eb@gmail.com",
        "time": "Wed Mar 18 22:20:47 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Wed Mar 18 22:20:47 2026 +0800"
      },
      "message": "[CELEBORN-2283][BUG] Fix missing return in Master.handleRequestSlots when all workers are excluded\n\n### What changes were proposed in this pull request?\n  Add a missing `return` statement after `context.reply()` in `Master#handleRequestSlots`\n  when `numAvailableWorkers \u003d\u003d 0`.\n\n### Why are the changes needed?\n  When all workers are excluded, the code replies with `WORKER_EXCLUDED` but continues\n  executing to `Random.nextInt(numAvailableWorkers)` (i.e. `Random.nextInt(0)`), which\n  throws `IllegalArgumentException`. This results in a duplicate response being sent to\n  the client and misleading error logs on the Master side.\n\n### Does this PR resolve a correctness bug?\n No.\n\n### Does this PR introduce _any_ user-facing change?\n No.\n\n### How was this patch tested?\nExisting unit tests.\n\nCloses #3628 from yew1eb/CELEBORN-2283.\n\nAuthored-by: yew1eb \u003cyew1eb@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "7b25797fb8cf08c3495419b1d872663b08f829fb",
      "tree": "f698d78b036c5534c79d58601d729d6c99d3440a",
      "parents": [
        "15cba469467b6038c5c40ee50c6420726a7c1a2e"
      ],
      "author": {
        "name": "Shuai Lu",
        "email": "lushuainada@gmail.com",
        "time": "Wed Mar 18 22:18:21 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Wed Mar 18 22:18:21 2026 +0800"
      },
      "message": "[CELEBORN-2274] Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED\n\n### What changes were proposed in this pull request?\n\nFix a bug in `MemoryManager.switchServingState()` where replicate channels permanently lose `autoRead\u003dtrue` after a memory pressure event.\n\nWhen the serving state transitions from `PUSH_AND_REPLICATE_PAUSED` to `PUSH_PAUSED`, `resumeReplicate()` was only called inside the `!tryResumeByPinnedMemory()` guard. If `tryResumeByPinnedMemory()` returned `true`, the entire block was skipped and replicate channels were never resumed.\n\nThe fix moves `resumeReplicate()` outside the `tryResumeByPinnedMemory()` guard so it is always called when stepping down from `PUSH_AND_REPLICATE_PAUSED` to `PUSH_PAUSED`. This is a state machine invariant: `PUSH_PAUSED` means only push is paused; replicate must always be resumed.\n\n### Why are the changes needed?\n\nOnce replicate channels are stuck with `autoRead\u003dfalse`, Netty I/O threads stop reading from all replicate connections. Remote workers writing to the affected worker see their TCP send buffers fill up (zero window), causing pending writes to accumulate in `ChannelOutboundBuffer`. Each pending write holds a reference to a direct memory `ByteBuf`, causing direct memory to grow indefinitely on the remote workers.\n\nThe failure sequence:\n1. Worker hits memory pressure → state \u003d `PUSH_AND_REPLICATE_PAUSED` → all channels paused\n2. Pinned memory is low → `tryResumeByPinnedMemory()` returns `true` → `resumeByPinnedMemory(PUSH_PAUSED)` resumes push only, replicate not resumed\n3. Memory drops to push-only range → state \u003d `PUSH_PAUSED`, but `resumeReplicate()` is never called\n4. Replicate channels permanently stuck with `autoRead\u003dfalse`, causing unbounded direct memory growth on remote workers\n\n### Does this PR resolve a correctness bug?\n\nYes.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nAdded a new unit test `Test MemoryManager resume replicate by pinned memory` in `MemoryManagerSuite` that reproduces the exact failure scenario:\n1. Enter `PUSH_AND_REPLICATE_PAUSED` with low pinned memory (channels resumed by pinned memory path)\n2. Raise pinned memory so both push and replicate get paused\n3. Drop memory to `PUSH_PAUSED` range with low pinned memory\n4. Assert replicate listener is resumed — this assertion fails without the fix\n\nCloses #3616 from sl3635/CELEBORN-2274.\n\nAuthored-by: Shuai Lu \u003clushuainada@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "15cba469467b6038c5c40ee50c6420726a7c1a2e",
      "tree": "f60554c3990fbe0d826ae6a149306847fa6d3e50",
      "parents": [
        "af0ba1a5ec0e1faf3d4a0d189058c755aeb6b18c"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Mar 18 20:41:01 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Mar 18 20:41:01 2026 +0800"
      },
      "message": "[MINOR] Update repository references from incubator-gluten to gluten for TLP graduation of gluten\n\n### What changes were proposed in this pull request?\n\nUpdate repository references from incubator-gluten to gluten for TLP graduation of gluten.\n\n- Backport https://github.com/apache/gluten/pull/11735.\n- Close #3631.\n\n### Why are the changes needed?\n\nApache Gluten has already TLP graduated, which should update repository references.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNo.\n\nCloses #3633 from SteNicholas/gluten-graduation.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "af0ba1a5ec0e1faf3d4a0d189058c755aeb6b18c",
      "tree": "c991f7586da7fe2cded1fb9dd0b498deffd69afd",
      "parents": [
        "400e9518d7f19cacf8797971e169afa6bdea5be6"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Wed Mar 18 15:55:32 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Mar 18 15:55:32 2026 +0800"
      },
      "message": "[CELEBORN-2281] Improve error logging and null checks in CreditStreamManager\n\n### What changes were proposed in this pull request?\n\n- Initialize `AtomicReference\u003cIOException\u003e` with proper syntax.\n- Add exception to `logger.error` for better error context.\n- Simplify and improve null checks and logging in `addCredit` and `cleanResource` methods.\n\n### Why are the changes needed?\n\nnit.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nGHA.\n\nCloses #3626 from cxzl25/CELEBORN-2281.\n\nLead-authored-by: sychen \u003csychen@ctrip.com\u003e\nCo-authored-by: cxzl25 \u003c3898450+cxzl25@users.noreply.github.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "400e9518d7f19cacf8797971e169afa6bdea5be6",
      "tree": "4cf6016fecd2cbcc7f539a8b3a796b71234acc8f",
      "parents": [
        "8c2c9523d07a38a6c750834e8d9598d1e8571157"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Sat Mar 14 17:27:00 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Sat Mar 14 17:27:00 2026 +0800"
      },
      "message": "[CELEBORN-2280] Support celeborn.network.memory.allocator.type to specify netty memory allocator\n\n### What changes were proposed in this pull request?\n\nSupport `celeborn.network.memory.allocator.type` to specify netty memory allocator including `AdaptiveByteBufAllocator `.\n\n### Why are the changes needed?\n\nNetty 4.2 introduces `AdaptiveByteBufAllocator` an auto-tuning pooling `ByteBufAllocator` which uses `AdaptivePoolingAllocator` added in https://github.com/netty/netty/pull/13075.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nIntroduce `celeborn.network.memory.allocator.type` to specify netty memory allocator.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3625 from SteNicholas/CELEBORN-2280.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "8c2c9523d07a38a6c750834e8d9598d1e8571157",
      "tree": "1662f58628e70605270ca37536b07a14d9e53023",
      "parents": [
        "4b157c68a5f7e82f4a50b8a2c8ae8989aff843c1"
      ],
      "author": {
        "name": "sychen",
        "email": "sychen@ctrip.com",
        "time": "Tue Mar 10 17:22:51 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Mar 11 19:43:16 2026 +0800"
      },
      "message": "[CELEBORN-2279] Update log level from `INFO` to `ERROR` for console output in spark-it tests\n\n### What changes were proposed in this pull request?\n\nUpdate log level from `INFO` to `ERROR` for console output in spark-it tests.\n\n### Why are the changes needed?\n\n`spark-it` outputs too many INFO level logs to stdout.\n\n\u003cimg width\u003d\"1017\" height\u003d\"143\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/41667a37-050b-4174-afe6-6e4afcda8fcc\" /\u003e\n\n\u003cimg width\u003d\"783\" height\u003d\"188\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/6bb4f3e3-223d-4f8c-bea6-4c8d451bf4b9\" /\u003e\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nGHA.\n\nCloses #3623 from cxzl25/CELEBORN-2279.\n\nAuthored-by: sychen \u003csychen@ctrip.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "4b157c68a5f7e82f4a50b8a2c8ae8989aff843c1",
      "tree": "630c63b237000990298bee7089d64a1a7e789784",
      "parents": [
        "b78177f3ac7adceb1f0510d2111943702e726eba"
      ],
      "author": {
        "name": "Aravind Patnam",
        "email": "akpatnam25@gmail.com",
        "time": "Tue Mar 10 10:36:54 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Mar 10 10:36:54 2026 +0800"
      },
      "message": "[CELEBORN-2278] Make HTTP auth bypass API paths configurable\n\n### What changes were proposed in this pull request?\n\nAllow http paths that should be bypassed from auth to be configured. This is particularly useful when one of the read endpoints is used for health checks, and should not require auth each time for a high frequency operation.\n\n### Why are the changes needed?\n\nSee above.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nTrivial change, already added in our cluster for certain endpoints.\n\nCloses #3622 from akpatnam25/CELEBORN-2278.\n\nAuthored-by: Aravind Patnam \u003cakpatnam25@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "b78177f3ac7adceb1f0510d2111943702e726eba",
      "tree": "37c12a2779ffe43aca621e06fcc2d74b01350d08",
      "parents": [
        "391ef4bfc42b4c121d6a029d60689af15ab16b5b"
      ],
      "author": {
        "name": "Shuai Lu",
        "email": "lushuainada@gmail.com",
        "time": "Mon Mar 09 10:34:10 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 09 10:34:10 2026 +0800"
      },
      "message": "[CELEBORN-2238] Fix RuntimeException during stream cleanup preventing peer failover\n\n### What changes were proposed in this pull request?\n\nFix a bug in `CelebornInputStream` where a `RuntimeException` thrown during best-effort stream cleanup prevents peer failover when a primary worker becomes unregistered.\n\nIn `createReaderWithRetry`, when reader creation fails on the primary, the code tries to close the existing stream by calling `clientFactory.createClient()` before switching to the peer. This cleanup was wrapped in `catch (InterruptedException | IOException ex)`. When SASL authentication is configured, `SaslClientBootstrap` wraps `IOException` in `RuntimeException`, so the cleanup call also throws `RuntimeException`. This uncaught exception escapes the retry loop entirely, bypassing `location \u003d location.getPeer()` and causing the executor to exhaust retries on the same failed primary worker.\n\nThe fix adds `RuntimeException` to the cleanup catch clause — `catch (InterruptedException | IOException | RuntimeException ex)` — so that any exception during best-effort cleanup is logged and swallowed, allowing the peer switch to proceed.\n\n### Why are the changes needed?\n\nWithout this fix, when a worker pod is rotated or becomes unregistered and SASL authentication is enabled, the replica retry mechanism silently fails. The executor retries multiple times on the same dead primary worker and eventually fails the task, even though a healthy replica exists.\n\n### Does this PR resolve a correctness bug?\n\nYes.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nAdded `CelebornInputStreamPeerFailoverTest` with three unit tests:\n- `testPeerFailoverWithRuntimeExceptionDuringCleanup`: primary fails, cleanup throws `RuntimeException` (simulates SASL wrapping), replica succeeds — verifies the fix\n- `testPeerFailoverWithIOExceptionDuringCleanup`: same scenario with plain `IOException` during cleanup — verifies existing behavior is preserved\n- `testFailureWithoutPeer`: no replica configured, verifies retries are exhausted and `CelebornIOException` is thrown\n\nCloses #3617 from sl3635/CELEBORN-2238.\n\nAuthored-by: Shuai Lu \u003clushuainada@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "391ef4bfc42b4c121d6a029d60689af15ab16b5b",
      "tree": "da484e8735734efdb0ab4ba99e469141ff7d9c17",
      "parents": [
        "dca37496ce59bd67526548957d2f607af8eee6cc"
      ],
      "author": {
        "name": "ShlomiTubul",
        "email": "shlomi.tubul@placer.ai",
        "time": "Thu Mar 05 14:32:49 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Mar 05 14:32:49 2026 +0800"
      },
      "message": "[CELEBORN-2273] Fix cache mutation in TagsManager.getTaggedWorkers()\n\nWhat changes were proposed in this pull request?\ngetTaggedWorkers() obtains a direct reference to the cached Set from getWorkersWithTag()and then calls retainAll() on it to intersect with other tags and available workers. Since retainAll() mutates the Set in-place, this permanently corrupts the cached entry. When multiple applications with different tag combinations share the same master, one app\u0027s intersection shrinks the cached Set, causing subsequent lookups by other apps to find fewer or zero workers. Once corrupted to an empty Set, all future slot requests fail with WORKER_EXCLUDED until the cache is refreshed.\n\nWhy are the changes needed?\nDoes this PR resolve a correctness bug?\nYes\n\nDoes this PR introduce any user-facing change?\nNo\n\nHow was this patch tested?\ncustom image in my dev env + local test\n\nCloses #3615 from shlomitubul/main.\n\nAuthored-by: ShlomiTubul \u003cshlomi.tubul@placer.ai\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "dca37496ce59bd67526548957d2f607af8eee6cc",
      "tree": "a05960521f264643b17529754b19b185f20b7d00",
      "parents": [
        "13ea40c3d086a483f9913628b80640591c223508"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Mar 03 11:24:45 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Mar 03 11:24:45 2026 +0800"
      },
      "message": "[CELEBORN-2218] Bump lz4-java version from 1.8.0 to 1.10.4 to resolve CVE‐2025‐12183 and CVE-2025-66566\n\n### What changes were proposed in this pull request?\n\n- Bump lz4-java version from 1.8.0 to 1.10.4 to resolve CVE‐2025‐12183 and CVE-2025-66566.\n- `Lz4Decompressor` follows the [suggestion](https://github.com/apache/spark/pull/53290#issuecomment-3607045004) to move from `fastDecompressor` to `safeDecompressor` to mitigate the performance.\n\nBackport:\n\n- https://github.com/apache/spark/pull/53327\n- https://github.com/apache/spark/pull/53347\n- https://github.com/apache/spark/pull/53971\n- https://github.com/apache/spark/pull/53454\n- https://github.com/apache/spark/pull/54585\n\n### Why are the changes needed?\n\n- [CVE‐2025‐12183](https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183): Various lz4-java compression and decompression implementations do not guard against out-of-bounds memory access. Untrusted input may lead to denial of service and information disclosure. Vulnerable Maven coordinates: org.lz4:lz4-java up to and including 1.8.0.\n\n- [CVE-2025-66566](https://github.com/advisories/GHSA-cmp6-m4wj-q63q): Insufficient clearing of the output buffer in Java-based decompressor implementations in lz4-java 1.10.0 and earlier allows remote attackers to read previous buffer contents via crafted compressed input. In applications where the output buffer is reused without being cleared, this may lead to disclosure of sensitive data. JNI-based implementations are not affected.\n\nTherefore, lz4-java version should upgrade to 1.10.4.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3555 from SteNicholas/CELEBORN-2218.\n\nLead-authored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nCo-authored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "13ea40c3d086a483f9913628b80640591c223508",
      "tree": "e1ddbcf4de0ce6d76ec258a77b19989f323b3057",
      "parents": [
        "ffdb41674596970c9a3ad7a85a08f81b37ae622d"
      ],
      "author": {
        "name": "Cheng Pan",
        "email": "chengpan@apache.org",
        "time": "Mon Mar 02 20:01:28 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 02 20:01:28 2026 +0800"
      },
      "message": "[CELEBORN-2272] Add LZ4TPCDSDataBenchmark\n\n### What changes were proposed in this pull request?\n\nAdd LZ4TPCDSDataBenchmark, use TPC-DS data to measure compression/decompression perf.\n\n### Why are the changes needed?\n\nProvide benchmark reports to measure performance change when upgrading lz4-java.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nSee benchmark reports.\n\nCloses #3613 from pan3793/lz4-benchmark.\n\nAuthored-by: Cheng Pan \u003cchengpan@apache.org\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "ffdb41674596970c9a3ad7a85a08f81b37ae622d",
      "tree": "7c0f5fc1daddaa34ba73b853071435ff9fa75519",
      "parents": [
        "0a67e2b304f464bd6199982582f06324a00beca4"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "eolivelli@gmail.com",
        "time": "Mon Mar 02 18:07:39 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 02 18:07:39 2026 +0800"
      },
      "message": "[CELEBORN-2270] Fix problem with eviction to tiered storage during partition split\n\nNOTE: this PR is stacked on top of https://github.com/apache/celeborn/pull/3608\n\nPlease consider only 756d25e49ef5f0321b90002d319b72924b9f4196\n\n### What changes were proposed in this pull request?\n\nHandle the eviction to a different location type.\n\n### Why are the changes needed?\n\nBecause it may happen that a MEMORY file is to be evicted to another storage type (i.e. S3). This does not work.\n\nUsually, as described in tests in #3608 when you have tiered storage, the primary partition type is usually not MEMORY, but it may happen that during a partition split the client decides to use MEMORY.\n\nThis patch fixes the problem on the worker side.\nAn alternative fix would be to change the behavior of the client, and simulate what the master does when offering slots.\nSuch a change would be more involved and it won\u0027t make the server side resilient to this scenario.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n- New integration tests\n- Manual testing on real k8s cluster with S3\n\nCloses #3610 from eolivelli/CELEBORN-2270-fix-partition-split.\n\nAuthored-by: Enrico Olivelli \u003ceolivelli@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "0a67e2b304f464bd6199982582f06324a00beca4",
      "tree": "8511d5ed7e616b454e4c1795b7070fc0b271a06b",
      "parents": [
        "fddb81754b03326f15df0b84ea1568a0621b7b88"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 02 18:04:40 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Mar 02 18:04:40 2026 +0800"
      },
      "message": "[CELEBORN-2049] Bump Ratis version from 3.1.3 to 3.2.1\n\n### What changes were proposed in this pull request?\n\nBump Ratis version from 3.1.3 to 3.2.1 including adding options of `celeborn-ratis sh peer add -peers` in `celeborn_ratis_shell.md` to follow up https://github.com/apache/ratis/pull/1282.\n\n### Why are the changes needed?\n\nBump Ratis version from 3.1.3 to 3.2.1. Ratis has released v3.2.1, of which release note refers to [3.2.1](https://ratis.apache.org/post/3.2.1.html). The 3.2.1 version is a maintenance release with multiple improvements and bugfixes. The usage of `celeborn-ratis` is as follows:\n\n```\n$ celeborn-ratis sh\nUsage: ratis sh [generic options]\n\t [election [transfer] [stepDown] [pause] [resume]]\n\t [group [info] [list]]\n\t [local [raftMetaConf]]\n\t [peer [add] [remove] [setPriority]]\n\t [snapshot [create]]\n\n$ celeborn-ratis sh election transfer\nUsage: transfer -address \u003cHOSTNAME:PORT\u003e -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e] [-timeout \u003cTIMEOUT_IN_SECONDS\u003e]\n\n$ celeborn-ratis sh election stepDown\nUsage: stepDown -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e]\n\n$ celeborn-ratis sh election pause\nUsage: pause -address \u003cHOSTNAME:PORT\u003e -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e]\n\n$ celeborn-ratis sh election resume\nUsage: resume -address \u003cHOSTNAME:PORT\u003e -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e]\n\n$ celeborn-ratis sh group info\nUsage: info -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e]\n\n$ celeborn-ratis sh group list\nUsage: list -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e] \u003c[-serverAddress \u003cPEER0_HOST:PEER0_PORT\u003e]|[-peerId \u003cpeerId\u003e]\u003e\n\n$ celeborn-ratis sh peer add -peers\nUsage: add -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e] \u003c[-address \u003cPEER0_HOST:PEER0_PORT\u003e]|[-peerId \u003cpeerId\u003e]\u003e [-clientAddress \u003cCLIENT_ADDRESS1,CLIENT_ADDRESS2,...\u003e] [-adminAddress \u003cADMIN_ADDRESS1,ADMIN_ADDRESS2,...\u003e]\n\n$ celeborn-ratis sh peer remove -peers\nUsage: remove -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e] \u003c[-address \u003cPEER0_HOST:PEER0_PORT\u003e]|[-peerId \u003cpeerId\u003e]\u003e\n\n$ celeborn-ratis sh peer setPriority\nUsage: setPriority -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e] -addressPriority \u003cPEER_HOST:PEER_PORT|PRIORITY\u003e\n\n$ celeborn-ratis sh snapshot create\nUsage: create -peers \u003cPEER0_HOST:PEER0_PORT,PEER1_HOST:PEER1_PORT,PEER2_HOST:PEER2_PORT\u003e [-groupid \u003cRAFT_GROUP_ID\u003e] [-snapshotTimeout \u003ctimeoutInMs\u003e] [-peerId \u003craftPeerId\u003e]\n\n$ celeborn-ratis sh local raftMetaConf\nUsage: raftMetaConf -peers \u003c[P0_ID|]P0_HOST:P0_PORT,[P1_ID|]P1_HOST:P1_PORT,[P2_ID|]P2_HOST:P2_PORT\u003e -path \u003cPARENT_PATH_OF_RAFT_META_CONF\u003e\n```\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3612 from SteNicholas/CELEBORN-2049.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "fddb81754b03326f15df0b84ea1568a0621b7b88",
      "tree": "0c6bd54b3d26fda2922716b97c122c51fb147501",
      "parents": [
        "deb7538f23c739292030748abd99001f4aede225"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Sun Mar 01 09:54:10 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Sun Mar 01 09:54:10 2026 +0800"
      },
      "message": "[CELEBORN-2226][CIP-14] Support RetryFetchChunk functionality for Cel…\n\nImplement chunk-fetch retry logic in CelebornInputStream::getNextChunk(), matching the Java CelebornInputStream behavior. When a chunk fetch fails, the retry loop excludes the failed worker, switches to the peer replica (if available), and sleeps between retry rounds before creating a new reader.\n\nAdded getLocation() to PartitionReader interface and WorkerPartitionReader\n\nReplaced the stub getNextChunk() with full retry logic: excluded worker checks, peer switching, configurable retry count, sleep between retries\n\nUpdated moveToNextChunk() and moveToNextReader() to handle nullable returns from getNextChunk()\n\nAdded unit test for WorkerPartitionReader::getLocation()\n\nAdded unit tests for getNextChunk() retry logic\n\nCI and build passes\n\nCloses #3605 from afterincomparableyum/cpp-client/celeborn-2226.\n\nAuthored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "deb7538f23c739292030748abd99001f4aede225",
      "tree": "1417c6a61e30c9e371743bb11370b25e8e77a300",
      "parents": [
        "bc4dc12cae0fee63b5951658f2c89dc5725eec22"
      ],
      "author": {
        "name": "zhenghuan",
        "email": "zhenghuan@weidian.com",
        "time": "Sat Feb 28 13:16:58 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Sat Feb 28 13:16:58 2026 +0800"
      },
      "message": "[CELEBORN-2248] Implement lazy loading for columnar shuffle classes and skew shuffle method using static holder pattern\n\n### What changes were proposed in this pull request?\n\nThis PR converts the static initialization of columnar shuffle class constructors\nand skew shuffle method to lazy initialization using the initialization-on-demand\nholder idiom (static inner class pattern) in SparkUtils.java.\n\nSpecifically, the following changes were made:\n\n1. Introduced `ColumnarHashBasedShuffleWriterConstructorHolder` static inner class\n   to lazily initialize the constructor for ColumnarHashBasedShuffleWriter\n\n2. Introduced `ColumnarShuffleReaderConstructorHolder` static inner class to lazily\n   initialize the constructor for CelebornColumnarShuffleReader\n\n3. Introduced `CelebornSkewShuffleMethodHolder` static inner class to lazily\n   initialize the `isCelebornSkewedShuffle` method reference\n\n4. Modified `createColumnarHashBasedShuffleWriter()`, `createColumnarShuffleReader()`,\n   and `isCelebornSkewShuffleOrChildShuffle()` methods to use the holder pattern for\n   lazy initialization\n\n5. Added JavaDoc comments explaining the lazy loading mechanism\n\n### Why are the changes needed?\n\nThe current implementation statically initializes columnar shuffle class constructors\nand the skew shuffle method at SparkUtils class loading time, which means these\nclasses/methods are loaded regardless of whether they are actually used.\n\nThis lazy loading approach ensures that:\n- Columnar shuffle classes are only loaded when actually needed (when\n  `celeborn.columnarShuffle.enabled` is true and the create methods are called)\n- CelebornShuffleState class is only loaded when skew shuffle detection is needed\n- Reduces unnecessary class loading overhead for users not using these features\n- Improves startup performance and memory footprint\n- Aligns with the conditional usage pattern already present in SparkShuffleManager\n\nThe static holder pattern (initialization-on-demand holder idiom) provides several\nadvantages:\n- Thread-safe without explicit synchronization (guaranteed by JVM class loading mechanism)\n- No synchronization overhead at runtime (no volatile reads or lock acquisition)\n- Simpler and more concise code compared to double-checked locking\n- Recommended by Effective Java (Item 83) for lazy initialization\n\n### Does this PR resolve a correctness bug?\n\nNo, this is a performance optimization.\n\n### Does this PR introduce any user-facing change?\n\nNo. This change only affects when certain classes are loaded internally.\nThe functionality and API remain unchanged.\n\n### How was this patch tested?\n\n- Code review to verify correct implementation of the initialization-on-demand holder pattern\n- Verified that JVM class loading guarantees thread safety\n- The changes are backward compatible and don\u0027t alter functionality, only initialization timing\n\nCloses #3581 from ever4Kenny/CELEBORN-2248.\n\nAuthored-by: zhenghuan \u003czhenghuan@weidian.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "bc4dc12cae0fee63b5951658f2c89dc5725eec22",
      "tree": "057c9900527589055798fb46be9bdd37bd31b38f",
      "parents": [
        "e119902b6c538789679a277056ea58f91e19f455"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Feb 27 19:59:26 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Feb 27 19:59:26 2026 +0800"
      },
      "message": "[CELEBORN-2239] Support Spark 4.1\n\n### What changes were proposed in this pull request?\n\nSupport Spark 4.1.\n\n### Why are the changes needed?\n\nSpark 4.1.1 has already released, which release notes refer to [Spark Release 4.1.1](https://spark.apache.org/releases/spark-release-4.1.1.html).\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3571 from SteNicholas/CELEBORN-2239.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "e119902b6c538789679a277056ea58f91e19f455",
      "tree": "0d1c4b21bea2d3b44594f29f1e5d4a9ef0ff1dde",
      "parents": [
        "9d48e6cc5f2da1a7612972b06b61f448fac9edab"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "eolivelli@gmail.com",
        "time": "Fri Feb 27 15:19:22 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Feb 27 15:19:22 2026 +0800"
      },
      "message": "[CELEBORN-2268] Improve test coverage for MEMORY and S3 storage\n\n### What changes were proposed in this pull request?\n\nThis commit adds only tests and some useful debug information about using MEMORY and S3 storage.\n\n### Why are the changes needed?\n\nBecause there is not enough code coverage on some configurations that may happen in production,  in particular about:\n- using MEMORY storage\n- using only S3 storage\n- using MEMORY with eviction to S3\n\nThere is an interesting case to test: when you configure MEMORY to S3 eviction and the dataset is small.\nIt is important to ensure that no file is created in S3\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nIt adds new integration tests.\n\nCloses #3608 from eolivelli/fix-eviction-apache.\n\nAuthored-by: Enrico Olivelli \u003ceolivelli@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "9d48e6cc5f2da1a7612972b06b61f448fac9edab",
      "tree": "dbf6aa8556846a52dfa746ddb85d39c8635f9363",
      "parents": [
        "ad5a88161cdd6febfb499f00dfbedd8ecb9d9d2a"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Feb 27 11:39:59 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Fri Feb 27 11:39:59 2026 +0800"
      },
      "message": "[CELEBORN-2258] Bump Netty version from 4.1.118.Final to 4.2.10.Final\n\n### What changes were proposed in this pull request?\n\nBump Netty version from 4.1.118.Final to 4.2.10.Final, which follows the official community migration guide: [Netty-4.2-Migration-Guide](https://github.com/netty/netty/wiki/Netty-4.2-Migration-Guide).\n\n### Why are the changes needed?\n\nThe Netty 4.2.10.Final version has been released, which netty version is 4.1.118.Final at present.\n\nBackport:\n\n- https://github.com/apache/spark/pull/34881\n- https://github.com/apache/spark/pull/52552\n- https://github.com/apache/spark/pull/53499\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3596 from SteNicholas/CELEBORN-2258.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "ad5a88161cdd6febfb499f00dfbedd8ecb9d9d2a",
      "tree": "322694754278624d7320eb6875b8d943df972c2b",
      "parents": [
        "2460efd95e59dfa58a53f334fe2d25c3073cc013"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Thu Feb 26 11:24:37 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Feb 26 11:24:37 2026 +0800"
      },
      "message": "[CELEBORN-2267][FOLLOWUP] Add Cpp-Write Java-Read integration tests for LZ4 and ZSTD\n\nThis is a follow up to https://github.com/apache/celeborn/pull/3575\n\n- Add compression codec argument to C++ DataSumWithWriterClient and set it in CelebornConf so the writer uses LZ4/ZSTD when enabled\n- Pass codec from runCppWriteJavaRead to the C++ writer command\n- Add CppWriteJavaReadTestWithLZ4 and CppWriteJavaReadTestWithZSTD (mirroring CppWriteJavaReadTestWithNONE)\n\n### How was this patch tested?\n\nI compiled and ran tests locally, all passed.\n\nCloses #3606 from afterincomparableyum/cpp-client/celeborn-2267.\n\nAuthored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "2460efd95e59dfa58a53f334fe2d25c3073cc013",
      "tree": "f2582dba611ab6212f4a1c78e4824bde03557c2a",
      "parents": [
        "4d97a8560a3bf5a839c14282a111eaf54bdac35f"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "224495379+afterincomparableyum@users.noreply.github.com",
        "time": "Thu Feb 26 11:18:48 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Feb 26 11:18:48 2026 +0800"
      },
      "message": "[CELEBORN-2222][CIP-14] Support Retrying when createReader failed for CelebornInputStream in CppClient\n\nThis PR implements retry support for createReader failures in the C++ client, matching the behavior of the Java implementation. The implementation includes:\n\n- Added configuration properties:\n  * clientFetchMaxRetriesForEachReplica (default: 3)\n  * dataIoRetryWait (default: 5s)\n  * clientPushReplicateEnabled (default: false)\n  * excludeWorkerOnFailure (default: false)\n  * excludedWorker.expireTimeout (default: 60s)\n  * optimizeSkewedPartitionRead (default: false)\n\n- Added peer location support methods to PartitionLocation:\n  * hasPeer() - Check if location has a peer replica\n  * getPeer() - Get the peer location\n  * hostAndFetchPort() - Get host:port string for logging\n\n- Implemented retry logic in createReaderWithRetry():\n  * Retries up to fetchChunkMaxRetry_ times (doubled if replication enabled)[which is why I added this parameter in this PR]\n  * Switches to peer location on failure when available\n  * Sleeps between retries when both replicas tried or no peer exists\n  * Resets retry counter when moving to new location or on success\n\n- Added unit tests for new functionality\n\n### How was this patch tested?\n\nUnit tests and compiling\n\nCloses #3583 from afterincomparableyum/cpp-client/celeborn-2222.\n\nAuthored-by: afterincomparableyum \u003c224495379+afterincomparableyum@users.noreply.github.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "4d97a8560a3bf5a839c14282a111eaf54bdac35f",
      "tree": "fd334fda3cfc982c7f5c49a481beb51ccc3590a9",
      "parents": [
        "feb3ed90c36d924dbfab8e2bec1972c4ef162486"
      ],
      "author": {
        "name": "Prateek Srivastava",
        "email": "me@prateek.io",
        "time": "Thu Feb 26 11:15:53 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Feb 26 11:15:53 2026 +0800"
      },
      "message": "[CELEBORN-2271] StorageManager#saveCommittedFileInfosExecutor should call shutdown before awaitTermination\n\n### What changes were proposed in this pull request?\nCall saveCommittedFileInfosExecutor.shutdown() before awaitTermination() in saveAllCommittedFileInfosToDB() so the executor shuts down correctly during worker shutdown.\n\n### Why are the changes needed?\nawaitTermination() only waits for the executor to finish after a shutdown has been requested; without shutdown(), the executor keeps running and can schedule more work.\n\nFrom [ExecutorService documentation](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#awaitTermination-long-java.util.concurrent.TimeUnit-):\n\n\u003e Blocks until all tasks have completed execution \"after a shutdown request\", ...\n\n### Does this PR resolve a correctness bug?\n\nPossibly, as it could lead to race conditions writing to RocksDB during shutdown, which could cause data loss or correctness issues.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nShould be exercised by  existing tests to ensure this doesn\u0027t introduce a regression.\n\nCloses #3607 from f2prateek/fix-shutdown.\n\nAuthored-by: Prateek Srivastava \u003cme@prateek.io\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "feb3ed90c36d924dbfab8e2bec1972c4ef162486",
      "tree": "b164a6587eeb65bd015739768fb16c8ff55900eb",
      "parents": [
        "ac6d1cf5d399618f934bd82983ef9632d278b340"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "eolivelli@gmail.com",
        "time": "Wed Feb 25 09:46:54 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Wed Feb 25 09:46:54 2026 +0800"
      },
      "message": "[CELEBORN-2262] Prepare S3 directory only once and cache s3 client for MultiPartUploader\n\n### What changes were proposed in this pull request?\n\n- Create only one S3 client for all MultiPartUploaders\n- Create S3 worker directory only once and not for every slot\n\n### Why are the changes needed?\n- Because on S3 AWS creating connections is slow (due to credentials handshaking and TLS handshaking)\n- Because \"mkdirs\" in S3 AWS is very slow (and it needs several S3 calls)\n\nSample CPU flamegraph about the need of Connection pooling:\n\u003cimg width\u003d\"2248\" height\u003d\"1275\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/5fb46d8f-5a1e-41a0-a8ca-01c92a2a3eb0\" /\u003e\n\nSample CPU flamegraph about the need of pooling the client due to AssumeRoleWithWebIdentity\n\u003cimg width\u003d\"2248\" height\u003d\"1275\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/e9efbadd-ef68-40d3-8fb5-d8fe43f56752\" /\u003e\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nManual testing.\n\nThere is one end-to-end integration test with S3 that exercise this code\n\nCloses #3604 from eolivelli/improve-s3-apache.\n\nAuthored-by: Enrico Olivelli \u003ceolivelli@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "ac6d1cf5d399618f934bd82983ef9632d278b340",
      "tree": "17379cc92c3ed2163f40098d9d65c16fe1fdf7a7",
      "parents": [
        "e4836b75527ace35837ee0c87b6585b1d8701b66"
      ],
      "author": {
        "name": "jay.narale",
        "email": "jay.narale@uber.com",
        "time": "Tue Feb 24 13:55:59 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Tue Feb 24 13:55:59 2026 +0800"
      },
      "message": "[CELEBORN-2269] Update Cpp TransportClient to resolve hostnames via DNS\n\n### What changes were proposed in this pull request?\n\nThis PR changes the folly::SocketAddress constructor calls in TransportClientFactory::createClient to pass true for the allowNameLookup parameter. This affects two call sites: the address used as the connection pool key, and the address used when connecting the bootstrap to the server.\n\nFolly code - https://github.com/facebook/folly/blob/main/folly/SocketAddress.h#L80\n\n### Why are the changes needed?\n\nWithout allowNameLookup \u003d true, folly::SocketAddress only accepts numeric IP addresses. When a Celeborn worker is addressed by hostnamehe constructor throws an \"invalid address\" exception, causing all connections to that worker to fail.\n\nSetting the parameter to true makes folly::SocketAddress use getaddrinfo, which transparently handles both hostnames (via DNS resolution) and numeric IPs. This is safe and backward-compatible since getaddrinfo recognizes numeric addresses without issuing a DNS query.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3609 from jaystarshot/u-c.\n\nAuthored-by: jay.narale \u003cjay.narale@uber.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "e4836b75527ace35837ee0c87b6585b1d8701b66",
      "tree": "7ad51baebe30e80abd02e114cb78cfbb2fff60d7",
      "parents": [
        "13cd4a9232b7f771e2314943d8614598f7d62283"
      ],
      "author": {
        "name": "jay.narale",
        "email": "jay.narale@uber.com",
        "time": "Thu Feb 19 16:23:11 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Feb 19 16:23:11 2026 +0800"
      },
      "message": "[CELEBORN-2266] Modernize Protobuf CMake usage and add install rules\n\n### What changes were proposed in this pull request?\n\n- Switched Protobuf CMake integration from the legacy FindProtobuf module to modern CONFIG mode with imported targets (protobuf::protoc, protobuf::libprotobuf).\n\n- Added install() rules for public headers, generated proto headers, and static libraries so the C++ client can be consumed as an installed package.\n\n- Added missing #include \u003cset\u003e in CelebornUtils.h.\n\n### Why are the changes needed?\n\nThe legacy FindProtobuf module-variable style (${PROTOBUF_LIBRARY}, bare protoc command) is fragile and does not work reliably with package managers like vcpkg or Conan that provide Protobuf via CMake config files. Switching to CONFIG mode and imported targets is the modern CMake best practice and ensures the correct protoc binary and library are used regardless of the build environment.\n\nThe install rules are needed so that downstream projects can consume the Celeborn C++ client from a clean install prefix rather than pointing directly at the source and build trees.\n\nThe missing \u003cset\u003e include was causing compilation failures in certain toolchain configurations where the header was not transitively included.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3602 from jaystarshot/u-c.\n\nAuthored-by: jay.narale \u003cjay.narale@uber.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "13cd4a9232b7f771e2314943d8614598f7d62283",
      "tree": "2421822ef06a5e9a037d070468b61c7451931402",
      "parents": [
        "eb7a720ac9c60c14c9f2b7091888911429ab1c7b"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "eolivelli@gmail.com",
        "time": "Tue Feb 17 16:37:08 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Feb 17 16:37:08 2026 +0800"
      },
      "message": "[CELEBORN-2265] Do not waste resources on hotpath for debug logging in HashBasedShuffleWriter and SortBasedShuffleWriter\n\n### What changes were proposed in this pull request?\nCompute logging message only when needed in order to save CPU cycles on the hotpath.\n\n### Why are the changes needed?\n\n\u003cimg width\u003d\"2561\" height\u003d\"687\" alt\u003d\"image\" src\u003d\"https://github.com/user-attachments/assets/53733848-ebcd-4079-a77a-9aa38ed1e90a\" /\u003e\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nManual testing\n\nCloses #3603 from eolivelli/flushbuffer-logger.\n\nAuthored-by: Enrico Olivelli \u003ceolivelli@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "eb7a720ac9c60c14c9f2b7091888911429ab1c7b",
      "tree": "a511e48a3fbf41ab0981529e02c45c00977dce55",
      "parents": [
        "8e6f4d5f95f58238913bf6f5bc769e5508d64efe"
      ],
      "author": {
        "name": "Dzeri96",
        "email": "13813363+Dzeri96@users.noreply.github.com",
        "time": "Tue Feb 17 16:32:46 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Feb 17 16:32:46 2026 +0800"
      },
      "message": "[CELEBORN-2259] The S3MultipartUploadHandler uses fs.s3a.aws.credentials.provider\n\n### What changes were proposed in this pull request?\n\nThe S3 Client in `S3MultipartUploadHandler` now uses the dynamic config `fs.s3a.aws.credentials.provider` in order to set its provider chain up.\n\n### Why are the changes needed?\n\nBefore this, it was only possible to use the hard-coded provider configuration.\n\n### Does this PR resolve a correctness bug?\n\nSort of.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, in the sense that `celeborn.hadoop.fs.s3a.aws.credentials.provider` will now work correctly in the MultiPartHandler.\n\n### How was this patch tested?\n\nUnit tests and a manual test.\n**Note**: I don\u0027t like having to change the class in order to make it testable, but I\u0027m planning to get rid of this whole logic in another PR, where we will use the same hadoop-created S3 client everywhere.\n\nCloses #3599 from Dzeri96/CELEBORN-2259-cherrypicked.\n\nLead-authored-by: Dzeri96 \u003c13813363+Dzeri96@users.noreply.github.com\u003e\nCo-authored-by: Filip Darmanovic \u003cdzeri96@proton.me\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "8e6f4d5f95f58238913bf6f5bc769e5508d64efe",
      "tree": "20aaf3ddcc8b21d89da75177577dbb1b232f047f",
      "parents": [
        "b1cbdabdf70230b7d2fff3ee0f7c44fa5a829f92"
      ],
      "author": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Feb 17 16:03:13 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Feb 17 16:03:13 2026 +0800"
      },
      "message": "[CELEBORN-2063] Parallelize the create partition writer in handleReserveSlots to speed up the reserveSlots RPC process time\n\n### What changes were proposed in this pull request?\n\nParallelize the create partition writer in `handleReserveSlots` to speed up the reserveSlots RPC process time。\n\n### Why are the changes needed?\n\nThe creation of partition writer in `handleReserveSlots` could use parallelize way to speed up the reserveSlots RPC process time.\n\n### Does this PR introduce _any_ user-facing change?\n\nIntroduce `celeborn.worker.writer.create.parallel.enabled`, `celeborn.worker.writer.create.parallel.threads` and `eleborn.worker.writer.create.parallel.timeout` to config parallelize the creation of file writer.\n\n### How was this patch tested?\n\nCI.\n\nCloses #3387 from SteNicholas/CELEBORN-2063.\n\nAuthored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "b1cbdabdf70230b7d2fff3ee0f7c44fa5a829f92",
      "tree": "56cc0669aaa7b5dd65765128d17b9507c723bff1",
      "parents": [
        "4ed9cbd2a6f2019b79f248e13cc99b14dc6e23e0"
      ],
      "author": {
        "name": "afterincomparableyum",
        "email": "afterincomparableyum",
        "time": "Mon Feb 16 14:59:18 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Mon Feb 16 14:59:18 2026 +0800"
      },
      "message": "[CELEBORN-2221][CIP-14] Support writing with compression in C++ client\n\nIntegrate existing compression infrastructure (LZ4 and ZSTD) into the C++ client write path. This enables compression during pushData operations, matching the functionality available in the Java client.\n\nChanges:\n- Add compression support to ShuffleClientImpl:\n  * Add shuffleCompressionEnabled_ flag and compressor_ member\n  * Initialize compressor from CelebornConf in constructor\n  * Compress data in pushData() when compression is enabled\n  * Use compressed size for batchBytesSize tracking\n\n- Configuration integration:\n  * Read compression codec from celeborn.client.shuffle.compression.codec\n  * Read ZSTD compression level from celeborn.client.shuffle.compression.zstd.level\n  * Default to NONE (compression disabled)\n\n- Retry/revive support:\n  * Retry path correctly uses pre-compressed body buffer\n  * No re-compression needed during retries\n\n- Testing:\n  * Add CompressorFactoryTest for factory pattern and config integration\n  * Add compression config tests to CelebornConfTest\n  * Test offset compression support for both LZ4 and ZSTD\n\n### How was this patch tested?\n\nUnit Tests, as well as compiling code\n\nCloses #3575 from afterincomparableyum/cpp-client/celeborn-2221.\n\nAuthored-by: afterincomparableyum \u003cafterincomparableyum\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "4ed9cbd2a6f2019b79f248e13cc99b14dc6e23e0",
      "tree": "93f16d1d9ba6007f6603c81e8f644ea2059acfc2",
      "parents": [
        "81d89f3ecac24fe22b29a2965a740c8bf93ce186"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "eolivelli@gmail.com",
        "time": "Sun Feb 15 23:16:29 2026 +0800"
      },
      "committer": {
        "name": "Shuang",
        "email": "lvshuang.xjs@alibaba-inc.com",
        "time": "Sun Feb 15 23:16:29 2026 +0800"
      },
      "message": "[CELEBORN-2263] Fix IndexOutOfBoundsException while reading from S3\n\n### What changes were proposed in this pull request?\n\nProperly pass the size of the array to the InputStream that feeds the flush.\n\n### Why are the changes needed?\n\nBecause without this change if the array is bigger than the buffer, then the inputstream returns garbage, resulting in corrupted data on S3.\n\n### Does this PR resolve a correctness bug?\n\nYes.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nNew unit test + Manual testing.\n\nCloses #3600 from eolivelli/CELEBORN-2263-apache.\n\nAuthored-by: Enrico Olivelli \u003ceolivelli@gmail.com\u003e\nSigned-off-by: Shuang \u003clvshuang.xjs@alibaba-inc.com\u003e\n"
    },
    {
      "commit": "81d89f3ecac24fe22b29a2965a740c8bf93ce186",
      "tree": "552647c3ab78d890ae3818e990e3d9c5cff39649",
      "parents": [
        "b659439edc10f6b4f1768c4810222cd8e909151a"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "eolivelli@gmail.com",
        "time": "Thu Feb 05 14:12:04 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Feb 05 14:12:04 2026 +0800"
      },
      "message": "[CELEBORN-2256] Helm chart: add support for setting annotations on the service account (to support eks.amazonaws.com/role-arn)\n\n### What changes were proposed in this pull request?\nAdding support for setting \"annotations\" on the Celeborn Service Account.\n\nPatch originally contributed by Filip Darmanovic Dzeri96\n\n### Why are the changes needed?\nThis is needed to support AWS IAM roles in k8s EKS\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, you can now configure the annotations.\n\n### How was this patch tested?\n\nManual testing + unit tests\n\nCloses #3595 from eolivelli/CELEBORN-2256.\n\nAuthored-by: Enrico Olivelli \u003ceolivelli@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "b659439edc10f6b4f1768c4810222cd8e909151a",
      "tree": "8d876d32f37c5ff7ba5d14ec61dea685356b9ae0",
      "parents": [
        "2097ad0a347b982e68ac07779ee1ab11815b524b"
      ],
      "author": {
        "name": "Enrico Olivelli",
        "email": "enrico@beast.io",
        "time": "Thu Feb 05 13:39:12 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Feb 05 13:44:59 2026 +0800"
      },
      "message": "[CELEBORN-2254] Fix support for S3 and add a simple integration test\n\n### What changes were proposed in this pull request?\n\n* Fix creating files to S3 (and other DFS)\n* Add integration test for Spark and S3 (using Minio)\n* in CI some job will run with the AWS profile because this way we can activate the new integration test (that needs the S3 client dependencies)\n\n### Why are the changes needed?\n\nSee https://issues.apache.org/jira/browse/CELEBORN-2254.\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\n* I have added an integration test\n* I have this patch on out internal fork, to make Celeborn run on k8s with S3\n\nCloses #3592 from eolivelli/CELEBORN-2254-apache.\n\nAuthored-by: Enrico Olivelli \u003cenrico@beast.io\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "2097ad0a347b982e68ac07779ee1ab11815b524b",
      "tree": "d5defbf92becdc6041c54191129e7667db2f39e6",
      "parents": [
        "475293663c7519c4f4e4fee9a37d4fd1900003a1"
      ],
      "author": {
        "name": "xxx",
        "email": "953396112@qq.com",
        "time": "Wed Feb 04 19:38:06 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Feb 04 19:38:06 2026 +0800"
      },
      "message": "[CELEBORN-2205] Introduce metrics to fetch chunk for memory and local disk\n\n### What changes were proposed in this pull request?\n\nIntroduce metrics to fetch chunk time for memory and local disk.\n\n### Why are the changes needed?\n\nIntroduce metrics to fetch chunk time for memory and local disk.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\n[Grafana](https://xy2953396112.grafana.net/public-dashboards/979279524ef74b6b92d0c08c39aa7c9e)\n\nCloses #3546 from xy2953396112/CELEBORN-2205.\n\nAuthored-by: xxx \u003c953396112@qq.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "475293663c7519c4f4e4fee9a37d4fd1900003a1",
      "tree": "b1f0e96e70fbc01723b546d8b215d4455459e64d",
      "parents": [
        "46a49a8285b3a77f51b1a3223760cab7d8182667"
      ],
      "author": {
        "name": "Sanskar Modi",
        "email": "sanskarmodi97@gmail.com",
        "time": "Wed Feb 04 19:34:26 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Wed Feb 04 19:34:26 2026 +0800"
      },
      "message": "[CELEBORN-1577][FOLLOWUP] Fix master resource consumption metrics\n\n### What changes were proposed in this pull request?\n\nFix master resource consumption metrics. https://github.com/apache/celeborn/pull/2819/ introduced a bug in master resource consumption metrics where we passed a local variable as GaugeSupplier leading to static values for user resource consumption.\n\n### Why are the changes needed?\n\nCurrently the code is buggy and gives a static value\n\n### Does this PR resolve a correctness bug?\n\nNo\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\nGA in our cluster.\n\nCloses #3591 from s0nskar/CELEBORN-1577.\n\nAuthored-by: Sanskar Modi \u003csanskarmodi97@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "46a49a8285b3a77f51b1a3223760cab7d8182667",
      "tree": "9cb17a828013b27b24808c92a1b95721a6bd40ee",
      "parents": [
        "f04ddddc2d4a7f5311e4ef802cb49d84cb0eea95"
      ],
      "author": {
        "name": "yew1eb",
        "email": "yew1eb@gmail.com",
        "time": "Tue Jan 27 21:15:44 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Tue Jan 27 21:15:44 2026 +0800"
      },
      "message": "[CELEBORN-2250] Fix lock contention in ReducePartitionCommitHandler.finishMapperAttempt via fine-grained locks\n\n### What changes were proposed in this pull request?\nAdd `shuffleIdLocks` (fine-grained locks per shuffleId), replace global `shuffleMapperAttempts` lock in `initMapperAttempts` and `finishMapperAttempt`.\n\n### Why are the changes needed?\nHigh concurrency causes lock contention on `shuffleMapperAttempts` in `finishMapperAttempt`, leading to abnormally long shuffle write time for small queries in Kyuubi Shared Mode. Fine-grained locks eliminate cross-shuffle blocking and improve concurrency.\n\n### Does this PR resolve a correctness bug?\nNo.\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nCI.\n\nCloses #3586 from yew1eb/CELEBORN-2250.\n\nAuthored-by: yew1eb \u003cyew1eb@gmail.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    },
    {
      "commit": "f04ddddc2d4a7f5311e4ef802cb49d84cb0eea95",
      "tree": "b3a544f48fbb574816357662bd5f255a9ddc98c7",
      "parents": [
        "e81cea069644fd9f35f953666a174e1b6a9763a0"
      ],
      "author": {
        "name": "luogen.lg",
        "email": "luogen.lg@alibaba-inc.com",
        "time": "Thu Jan 22 10:35:44 2026 +0800"
      },
      "committer": {
        "name": "SteNicholas",
        "email": "programgeek@163.com",
        "time": "Thu Jan 22 10:35:44 2026 +0800"
      },
      "message": "[CELEBORN-2251] Introducing a shim layer and a common-tiered submodule for Flink clients\n\n### What changes were proposed in this pull request?\n\n1. Introduce a client-flink-common-tiered submodule to host the shared tiered shuffle logic.\n2. Introduce a shim layer to further unify the implementations and make version-specific changes more explicit.\n3. Unify the tests as well, so that versioned clients can simply run the common test suite with their own shims.\n\n### Why are the changes needed?\n\nThough Celeborn now has a client-flink-common submodule, clients still have to copy a lot of code from version to version, with small but necessary changes buried in the duplicates. The tests don’t share a common implementation either. Even worse, with Flink introducing tiered shuffle from 1.20 onward, all tiered shuffle implementations must be placed inside the versioned clients to ensure that client-flink-common can be compiled against all Flink versions. This makes it much harder to evolve the tiered shuffle implementations.\nFor the maintenance of Flink clients in the future, we need better organization of the submodules, consolidating the implementations and tests.\n\n### Does this PR resolve a correctness bug?\n\nNo.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nThe patch consolidates all current Flink client tests, and can be tested with them.\n\nCloses #3589 from pltbkd/flink-shim-0.7.\n\nLead-authored-by: luogen.lg \u003cluogen.lg@alibaba-inc.com\u003e\nCo-authored-by: SteNicholas \u003cprogramgeek@163.com\u003e\nSigned-off-by: SteNicholas \u003cprogramgeek@163.com\u003e\n"
    }
  ],
  "next": "e81cea069644fd9f35f953666a174e1b6a9763a0"
}
