blob: af05622f592cab72fc02205b8897d2677c7972c5 [file] [view]
# Coding guidelines
Apache Pulsar follows the Sun Java Coding Conventions with additional project-specific rules. The
codebase is performance-critical, asynchronous, and concurrency-sensitive, so code review prioritizes
**correctness, thread safety, performance, maintainability, and backward compatibility**. This file is
the canonical coding reference for human contributors and AI coding agents; see [`AGENTS.md`](AGENTS.md)
for the agent-specific guardrails on top of it.
## Style
- **4 spaces** for indentation; **tabs must never be used**.
- Always use **curly braces**, even for single-line `if` statements.
- No `@author` tags in Javadoc.
- Every `TODO` must reference a GitHub issue, e.g. `// TODO: https://github.com/apache/pulsar/issues/XXXX`.
- Checkstyle config: `buildtools/src/main/resources/pulsar/checkstyle.xml`. Lombok is enabled.
- Prefer imports over fully qualified class names in code. Use a fully qualified class name only when
needed to disambiguate a name collision that imports cannot resolve.
## Logging
- Prefer **[slog](https://github.com/merlimat/slog)** (`io.github.merlimat.slog`) via Lombok's
**`@CustomLog`** (wired in `lombok.config` to `Logger.get(TYPE)`). **SLF4J is deprecated** for new
code; never use `System.out` / `System.err`.
- **Default new logs to `TRACE`/`DEBUG`, not `INFO`** — Pulsar overuses `INFO` and floods production
logs. Reserve `INFO` for low-frequency lifecycle/state-change events.
- Attach data as **structured attributes** — `log.info().attr("topic", topic).log("Published")` — not
interpolated into the message string.
- For expensive `DEBUG`/`TRACE` values, don't guard with `isDebugEnabled()`/`isTraceEnabled()`; use
slog's lazy form — `log.debug().attr("dump", () -> expensiveDump()).log("...")` or
`log.debug(e -> e.attr("dump", expensiveDump()).log("..."))`.
- Avoid logging on hot paths, and stack traces at `INFO` or lower.
- Use `DEBUG` in a way where it could be enabled in production without causing too many log entries. Use `TRACE` for more detailed information.
## Asynchronous programming
Pulsar relies heavily on `CompletableFuture`; prefer it over `ListenableFuture` for new code.
- **A method returning `CompletableFuture` must not throw synchronously.** Propagate failures through
the returned future — `return CompletableFuture.failedFuture(e);` — including for argument validation
(`if (arg == null) return CompletableFuture.failedFuture(new IllegalArgumentException("arg"));`).
Throwing *inside* a stage (`thenApply`, `thenCompose`, `handle`, `whenComplete`, …) is fine.
Avoid (escapes synchronously; a caller chaining `.exceptionally(...)` never sees it):
```java
CompletableFuture<T> process(String arg) {
if (arg == null) {
throw new IllegalArgumentException("arg");
}
return doProcessAsync(arg);
}
```
Prefer (report the validation failure through the returned future):
```java
CompletableFuture<T> process(String arg) {
if (arg == null) {
return CompletableFuture.failedFuture(new IllegalArgumentException("arg"));
}
return doProcessAsync(arg);
}
```
- **Never block on event-loop / async-execution threads** — no `Thread.sleep`, `Future.get()`,
`CompletableFuture.join()`, or blocking IO. An operation that performs IO should return a future.
- **Avoid nested futures** (`CompletableFuture<CompletableFuture<T>>`); flatten with `thenCompose`.
Prefer **`OrderedExecutor`** for ordered asynchronous work.
Avoid (`thenApply` on a future-returning function yields `CompletableFuture<CompletableFuture<R>>`):
```java
return firstAsync(arg).thenApply(v -> secondAsync(v));
```
Prefer (`thenCompose` flattens it to `CompletableFuture<R>`):
```java
return firstAsync(arg).thenCompose(v -> secondAsync(v));
```
- **Converting a synchronous-throwing method to a failed future is not mechanical** — some callers rely
on the throw happening *before* the async work starts, so evaluate each call site. Use a shared
`checkArgumentAsync` helper (in `FutureUtil`) to validate without duplicating try/catch.
- **Limit concurrency and handle backpressure.** Firing many async operations at once can overwhelm the
system. Options:
- **`com.spotify.futures.ConcurrencyReducer`** — caps in-flight futures at a configurable limit (used
in the Admin client to bound concurrent requests per broker).
- **`org.apache.pulsar.common.util.FutureUtil.Sequencer`** — runs async operations sequentially.
- **`org.apache.pulsar.common.semaphore.AsyncSemaphoreImpl`** — a non-blocking semaphore with a
per-operation cost that queues callers instead of failing when the limit is reached. Preferred over
`ConcurrencyReducer` for request-driven cases that need a timeout on permit acquisition.
## Testing conventions
Most Pulsar **"unit tests"** (`src/test`, run with `./gradlew :<module>:test`) are actually
**integration-style** — they start a real in-JVM broker (`MockedPulsarServiceBaseTest` /
`pulsarTestContext`) rather than testing a class in isolation. The **container integration tests**
under `tests/` run against a Pulsar Docker image (see
[`CONTRIBUTING.md`](CONTRIBUTING.md#integration-tests)). Ideally code is factored so genuine units *can*
be unit-tested in isolation with light mocking — excessive mocking is a design smell, not the goal —
but much existing code isn't, so integration-style is the pragmatic default. See
[`CONTRIBUTING.md`](CONTRIBUTING.md) for how to *run* tests (groups, `--tests` scoping, retry count).
- **TestNG + Mockito.** Prefer **AssertJ** assertions (with descriptions) over TestNG asserts; use
**Awaitility** for async conditions instead of `sleep` timing, with timeouts to prevent hangs.
`untilAsserted(...)` retries assertions, `until(...)` waits for a boolean — don't swap them. Verify
async interactions with Mockito `timeout(...)`, not fixed sleeps.
- Every feature or bug fix needs **deterministic** tests for edge and failure cases. A bug-fix test
must **fail on the unpatched code for the real reason** — not because it forces internal state.
- For code not factored for isolation, prefer an integration-style test over mocking a web of
collaborators: inject faults via the test infrastructure (e.g.
`pulsarTestContext.getMockBookKeeper().setReadHandleInterceptor(...)`) and assert on logs with
`TestLogAppender`. It's fine to add a **clean new test class** rather than extend an awkward one.
- **No reflection into private state** (`WhiteboxImpl.getInternalState`/`setInternalState`,
`setAccessible(true)`). Expose a **package-private `@VisibleForTesting`** accessor and put the test in
the same package; flag new reflection in review ([dev@ rationale](https://lists.apache.org/thread/7gr04sqmzyttx4ln6ydtp3qv0xgo1o6m)).
- **New integration-style tests: extend `SharedPulsarBaseTest`.** It shares one `SharedPulsarCluster`
for the test-JVM lifecycle (`admin` / `pulsarClient` are per test class); each method gets its own
namespace. Use `getNamespace()` and `newTopicName()` — never hardcode namespace/topic names, since
the runtime is shared.
- **Close/release what the test allocates.** A **`ByteBuf`/buffer leak** (pooled-allocator detection,
`-Dpulsar.allocator.pooled=true`) is a **real bug** — fix the missing `release()`. A **thread leak
from `ThreadLeakDetectorListener` is unreliable** (high false-positive rate, notably with
`SharedPulsarBaseTest` and when `THREAD_LEAK_DETECTOR_WAIT_MILLIS` is too low — ≈`10000` recommended,
only effective with the Gradle daemon disabled, `--no-daemon`); corroborate before treating it as
real.
- **Validate performance optimizations with a JMH benchmark** under `microbench/`, simulating a
realistic production usage pattern (see `microbench/README.md`).
## General recommendations
- **Use the narrowest interface type** for fields, parameters, variables, and returns (`Map`,
`SequencedMap`, `SortedMap`, `Collection`, `List`) rather than a concrete type like `TreeMap`. Keep
the concrete type only where its behaviour is required (e.g. a `TreeMap` for key-ordered iteration),
still exposed through the interface.
- **Minimize method and constructor parameters.** For a constructor with many parameters,
use a **builder** — the project uses Lombok `@Builder` for most internal classes, and it works on a
`record` too. Consider refactoring by moving related methods to a separate class when it's a better fit.
- **Don't return generic tuples.** Instead of `org.apache.commons.lang3.tuple.Pair<L, R>` (or a similar
tuple type), define a small, purpose-named **Java `record`** inline in the class that declares the
method, with the **same visibility as the method** (`public`, package-private, or `private`).
Avoid (positional and untyped; call sites read `getLeft()` / `getRight()`):
```java
private Pair<Integer, Integer> minMax(Collection<Integer> values) { ... }
```
Prefer (a purpose-named record with the same visibility as the method):
```java
private record MinMax(int min, int max) {}
private MinMax minMax(Collection<Integer> values) { ... }
```
- **Prefer record keys over concatenated strings.** For a composite `Map` key, use a small `record`
instead of concatenating a `String` (e.g. `a + ":" + b`) — correct `equals`/`hashCode`, type-safe,
no delimiter/escaping bugs.
Avoid (delimiter collisions when a value contains `:`; no type safety):
```java
Map<String, V> map = new HashMap<>();
map.get(a + ":" + b);
```
Prefer (a small record key with correct `equals`/`hashCode`):
```java
record Key(String a, String b) {}
Map<Key, V> map = new HashMap<>();
map.get(new Key(a, b));
```
- **Don't use `@Builder` on public client-API classes** (harder to maintain backwards compatibility) — hand-write the builder.
- **Name methods for intent.** A method's name should reveal what it does. Query methods read like
queries (`shouldSkipChunk`, not `skipChunk`); methods that mutate state or perform an action are
named for that action. **Reserve the `get` prefix for pure queries** — using it for a method that
mutates state, or otherwise does more than return a value is strongly discouraged.
## Dependencies
Prefer existing dependencies over new libraries. Pulsar commonly uses Apache Commons / Guava
(utilities), **FastUtil** (type-specific collections), **JCTools** (concurrent structures),
**RoaringBitmap** (compressed bitsets), **Caffeine** (caching), **Jackson** (JSON), Prometheus /
**OpenTelemetry** (metrics), and **Netty** (networking and buffers).
A new dependency must be justified (why existing ones are insufficient) and must update the
bundled-dependency `LICENSE`/`NOTICE` — verify with `./gradlew checkBinaryLicense`.
## Backward compatibility
Pulsar maintains strong compatibility guarantees. Changes must not break public APIs, client
compatibility, wire-protocol compatibility, or serialized/metadata formats — servers must work with
both older and newer clients. Flag any change that may break compatibility.
**Plugin / SPI extension points are public API.** Many interfaces are selected by a `*ClassName`
configuration setting — e.g. `LoadManager`, `LedgerOffloaderFactory`, `AuthorizationProvider` /
`AuthenticationProvider`, `EntryFilter`, `TopicFactory`, `BrokerInterceptor`, dispatcher /
delayed-delivery-tracker factories, `CustomCommand` — and third parties ship implementations. Changing
such an interface, or a `protected` member of an extensible class (`PulsarWebResource`,
`PersistentTopic`, `Producer`), breaks them: it generally needs a PIP and must not land in
maintenance-branch backports.
**Design interface changes for backward compatibility.** When you add a method to such an interface,
prefer a `default` implementation that delegates to an existing method, so older third-party
implementations keep working unchanged. If no sensible delegation exists, add a separate
capability-query method (e.g. `boolean supportsX()`) the broker checks at runtime, so it can support
older implementations gracefully instead of depending on the new method.
**Don't leak third-party types through public/plugin interfaces.** Exposing Netty or AsyncHttpClient
classes breaks consumers of the **shaded** client (shaded vs. unshaded classes differ) and couples
callers to the implementation — provide a Pulsar-owned abstraction. Changing a documented behaviour or
guarantee (e.g. PIP-68 exclusive-producer guarantees, default rate-limiter behaviour) needs a PIP and a
dev@ discussion, not just a code change.
**Introduce changes behind a backward-compatible default.** Make new/changed behaviour opt-in via
configuration rather than silently changing existing deployments. Behaviour that risks data loss (e.g.
skipping unrecoverable data) must be gated behind an explicit flag (such as `autoSkipNonRecoverableData`),
defaulting to the safe/old behaviour.
## Resource and memory management
- Always close resources (streams, connections, executors, buffers) — prefer try-with-resources.
- On internal networking/messaging paths, prefer **Netty `ByteBuf`** over `ByteBuffer` unless an
external API requires it; release ref-counted buffers you allocate.
- **Don't hand-optimize allocation away.** Pulsar runs on **ZGC** (very low collection overhead), so
the extra short-lived allocations from favouring immutable objects (see *Concurrency* below) are
cheap. Older code pools objects with Netty's `Recycler`; this is **no longer recommended for new
code** — under ZGC the `Recycler` often *costs* more CPU than it saves. Don't add new `Recycler`
usage. See [PIP-443](pip/pip-443.md).
## Performance
- **Back optimizations with evidence** — a JMH benchmark (see *Testing conventions*) or a profile, not
intuition — measured on JIT-warmed code (see *Reproducing concurrency / memory-visibility bugs*).
- **On hot paths** (dispatch, IO, per-message): avoid `String.format` (build strings directly),
`Enum.values()` (match explicitly), and unnecessary allocation/locking; prefer lock-free or
single-writer designs.
- **Don't add overhead to an already-overloaded system.** Avoid doing work then discarding it (e.g.
reading entries only to drop them before dispatch) — extra work under load causes cascading failures;
acquire/estimate up front and reconcile afterwards.
- **Bound in-memory caches** (size or byte limit + eviction) and de-duplicate repeated `String`s
(cluster/tenant/namespace ids) with `org.apache.pulsar.common.util.StringInterner`.
## Configuration
When adding configuration options: use clear, descriptive names; provide sensible defaults; update the
default configuration files; and document the option.
## Code review checklist
When reviewing a PR, verify:
- Java coding conventions followed; logging follows the guidelines above (slog, levels, structured
attributes).
- Thread-safety risks; no blocking in async paths; correct `CompletableFuture` usage.
- No unnecessary dependencies; LICENSE/NOTICE updated when dependencies change.
- Backward compatibility preserved.
- Tests exist and are appropriate; reflection into private state is flagged with a `@VisibleForTesting`
accessor suggested instead.
- The **PR description explains the change** — at minimum **Motivation (why?)** and **Modifications
(what/how?)**, matching `.github/PULL_REQUEST_TEMPLATE.md`; a title alone isn't sufficient.
Focus feedback on correctness, reliability, and maintainability.
## Concurrency
- Public classes should be **thread-safe**; annotate non-thread-safe ones with `@NotThreadSafe`.
- Protect shared mutable state; prefer fine-grained synchronization; mutate on the intended thread.
Prefer the **single-writer principle** (a given piece of state mutated by only one thread) to avoid
concurrent mutation entirely.
- **Minimize work while holding a lock.** Capture needed state into locals inside the synchronized
block, then run callbacks, listeners, and IO *outside* it — never call out to listener/callback code
while holding a lock (this has fixed real deadlocks and contention).
- Give threads **meaningful names**. When creating thread pools, prefer Netty's
**`io.netty.util.concurrent.DefaultThreadFactory`** — it produces **`FastThreadLocalThread`**
instances (lower overhead `FastThreadLocal` lookups, which matter on Netty paths like the pooled
`ByteBuf` allocator) and assigns prefixed thread names.
Pulsar has no documented, project-wide concurrency model yet; see
[`ARCHITECTURE.md` → Concurrency model](ARCHITECTURE.md#concurrency-model-a-known-gap) for the
conventions that *should* govern threads, thread pools, and event loops.
### The Java Memory Model is what makes concurrent code correct
Several hard-to-investigate Pulsar bugs have come from misconceptions about Java synchronization:
- **A `synchronized` method or block is not, on its own, thread-safe.** It provides its
visibility/ordering guarantees only when the **same monitor/lock guards both the reads and the
writes** of the shared state.
- On 64-bit JVMs a field's value is **never corrupted** — a read returns some value that was actually
written. What breaks is **visibility**: without a happens-before relationship, threads can observe
different values, or never see an update. Establish happens-before with `synchronized`, `volatile`,
`final`, or `java.util.concurrent` constructs.
- **A field accessed by more than one thread needs explicit visibility** — make it `volatile` (or
guard every read *and* write with the same lock). `volatile` gives single-field visibility but does
**not** make compound updates (read-modify-write, check-then-act) atomic — use `java.util.concurrent`
atomics/locks for those.
- Visibility is per-field, so a mutable object can be observed **partially updated**.
- The only way to be reliably correct is to **conform to the Java Memory Model**. **Benign data races**
are sometimes acceptable, and some Pulsar code relies on this by design — but only as a deliberate,
documented choice.
- **Prefer immutable objects.** An object is **immutable** when all fields are `final` *and* every
nested instance is itself immutable (a `record` is the common case; immutability must hold for the
whole reachable graph). It is **effectively immutable** when never modified after construction but
with non-`final` fields. Publication differs: an **immutable** object benefits from the JMM's
final-field **safe initialization** (visible even when published via a data race) and needs **no**
safe publication; an **effectively immutable** one must be shared via **safe publication** (a `final`
or `volatile` field, or a `java.util.concurrent` construct such as `ConcurrentHashMap`). See
[Safe initialization](https://shipilev.net/blog/2014/safe-public-construction/#_safe_initialization).
### Reproducing concurrency / memory-visibility bugs
These bugs are timing- and platform-dependent and easily masked, so a clean run is weak evidence a fix
is correct:
- Interpreted and JIT-compiled code behave differently. Reproductions often need several **warm-up
rounds with a short pause** so the (tiered, asynchronous) JIT kicks in; a short test may never
trigger compilation. JVM flags can force earlier compilation, and the exercised paths affect what
gets compiled.
- Some races surface only on specific **hardware/OS** — classically **multi-socket / multi-NUMA**
machines, whose weaker cross-socket memory ordering exposes races a single socket never shows.