PIP-443: Stop using Netty Recycler in new code

Background knowledge

Netty Recycler is widely used in many places to reduce GC pressure by reusing objects. From the outputs of the following command:

find . -name "*.java" | grep src/main | xargs grep -n "new .*Recycler"

You can find:

5 usages in pulsar-common
10 usages in pulsar-broker
10 usages in pulsar-client

However, many of them are only the result of cargo cult programming, which blindly believes objects allocated from the Recycler is better than allocated from heap memory. Hence, we should understand how it works first.

Recycler is basically a thread-local object pool that provides

a get method to allocate an object from the pool or call the user-provided newObject method to create a new one if the pool is empty or full.
a recycle method to return an object back to the pool.

Here is a typical implementation that wraps a Recycler to manage a triple:

@Data // support getter and setter for all fields
public class RecyclablePair {

    private final Recycler.Handle<RecyclablePair> handle;
    private Object x;
    private Object y;

    private RecyclablePair(Handle<RecyclablePair> handle) {
        this.handle = handle;
    }

    public void recycle() {
        this.x = null;
        this.y = null;
        handle.recycle(this);
    }

    public static RecyclablePair create(Object x, Object y) {
        final RecyclablePair s = RECYCLER.get();
        s.x = x;
        s.y = y;
        return s;
    }

    private static final Recycler<RecyclablePair> RECYCLER = new Recycler<RecyclablePair>() {
        @Override
        protected RecyclablePair newObject(Handle<RecyclablePair> handle) {
            return new RecyclablePair(handle);
        }
    };
}

It's typically used with a “recycle-after-use” pattern like:

var triple = RecyclablePair.create(a, b);
process(triple);
triple.recycle();

var triple = RecyclablePair.create(a, b);
executor.execute(() -> {
    process(triple);
    triple.recycle();
});

Unlike a simple thread local pool, recycling in a different thread is supported because in this case, recycle() will push the object (triple in the code above) into a queue that implements MessagePassingQueue.

The following code snippet of the thread local pool used by Recycler includes some key parts:

private static final class LocalPool<T> implements MessagePassingQueue.Consumer<DefaultHandle<T>> {
    private final ArrayDeque<DefaultHandle<T>> batch;
    private volatile Thread owner;
    private volatile MessagePassingQueue<DefaultHandle<T>> pooledHandles;

When get is called, it firsts get the thread local LocalPool and apply the following logic:

Drain some handles from pooledHandles to batch if batch is empty.
Poll a handle from batch and mark it as “claimed” (used by the current thread).
If handle is available, call the get method to return the object. In the example above, RecyclablePair#create calls this get method to get a pooled object and initializes some fields.
Otherwise, call newObject to create a new object from heap memory and associated with a new handle. With the default config, the handle is a “no-op handle” in 7/8 cases that won't push the object into pooledHandles when recycle is called. In the rest 1/8 cases, the handle type is a DefaultHandle that is associated with the LocalPool in this thread and can be pushed to pooledHandles and batch

When recycle is called if the handle is a DefaultHandle:

Good Case: if the object is allocated by get() in a FastThreadLocalThread and recycle() is called in the same thread, the handle is pushed to batch directly.
Bad Case: otherwise, the handle is pushed to pooledHandles queue, which is less efficient.

Motivation

There are mainly three issues of using Recycler in Pulsar.

Issue 1: performance overhead

As explained in the previous section, we can see that if the recyclable object like RecyclablePair is allocated not in a FastThreadLocalThread, it's not efficient. Here is an existing benchmark result running on a GitHub Actions Ubuntu 24.04 runner:

recycler.RecyclerBenchmark.testRecord                             thrpt   25  468034078.305 ± 1453202.080  ops/s
recycler.RecyclerBenchmark.testThreadLocalSingletonRecycler       thrpt   25  137434124.938 ±  173674.317  ops/s
recycler.RecyclerBenchmark.testThreadLocalQueueRecycler           thrpt   25   70585776.893 ±  418944.351  ops/s
recycler.RecyclerBenchmark.testRecycler                           thrpt   25   36061222.443 ±  368182.450  ops/s

testRecord allocates a simple tuple record object from heap memory
testThreadLocalSingletonRecycler allocates a tuple object from a thread local object
testThreadLocalQueueRecycler allocates a recyclable tuple object from a thread local queue (ArrayDeque) and add it to the queue (it simulates the case that recycle is called in a FastThreadLocalThread and recycled in the same thread)
testRecycler allocates a recyclable tuple object using Recycler and calling recycle then

The full benchmark code is here.

From the results above, the performance is about 7x worse than an object from heap memory in the good case and 13x worse in the bad case.

Issue 2: error-prone usage

Compared to the performance overhead, the error-prone usage is a more serious issue. Users have to call recycle() explicitly for each object like the error-prone malloc/free in C language.

var triple = RecyclablePair.create(a, b);
process(triple);
triple.recycle();

For example, if process in the code above throws an exception, recycle() will be skipped. #13233 is a real world fix for such issue. Though the issue is not serious because the worst case is that the pool is full and then the object will be allocated from heap buffer.

A more serious issue is that the object is shared among different threads and then recycle() are called twice, which will throw an exception for the 2nd recycle call. #24697 suffers this issue and was not fixed when this proposal is written. Since the exception is thrown by recycle() during the read path, it affects the client side that the consumer cannot receive messages anymore. #24725 fixes an issue that a recyclable object is unexpected shared between two threads, while if the object is not recyclable, the worst case is that the same batched message is acknowledged twice.

The most serious issue is that if the object is shared by two threads and one thread has recycled it, the second thread could access:

the object whose fields are reset with initial values by recycle()
or the object that has been reused in another task by get()

The 1st case might lead to NPE, which might be relatively fine because it's exposed in logs. The 2nd case is really dangerous because the object is in an inconsistent state and might not be exposed with any exception.

Since the ownership of a recyclable object must be unique, the move semantics can be implemented like:

@RequiredArgsConstructor
@Getter
static class MovableRecyableObject {

    private static final Recycler<MovableRecyableObject> RECYCLER = new Recycler<MovableRecyableObject>() {
        @Override
        protected MovableRecyableObject newObject(Handle<MovableRecyableObject> handle) {
            return new MovableRecyableObject(handle);
        }
    };

    private final Recycler.Handle<MovableRecyableObject> handle;
    private Object object;

    public static MovableRecyableObject create(Object object) {
        final var result = RECYCLER.get();
        result.object = object;
        return result;
    }

    public MovableRecyableObject move() {
        final var result = MovableRecyableObject.create(this.object);
        this.recycle();
        return result;
    }

    public void recycle() {
        this.object = null;
        handle.recycle(this);
    }
}

Before publishing the recyclable object to a different thread, move() must be called, so that if the object is accessed by a third thread later, it will fail fast with NPE. However, this makes code harder to write that developers have to follow this pattern. Not calling move() won't lead to a compile-time issue or runtime issue.

Without following the move-before-publish pattern, we have to make recycle synchronized, which will have more performance overhead:

    public synchronized void recycle() {
        if (this.object != null) {
            this.object = null;
            handle.recycle(this);
        }
    }

Issue 3: questionable benefits

How much GC pressure can be reduced by sacrificing the memory allocation latency is hard to measure and never proved by any benchmark. When Recycler was introduced in Netty, Java 8 was even not released. The main use case of Recycler in Netty is for ByteBuf to avoid the time-consuming buffer memory allocation from direct memory, where the buffer is used as a request for users to process in the FastThreadLocalThread.

However, Java GC has involved a lot in these years. For example, Generational ZGC, which is introduced since Java 21, performs much better on tail latencies than the previous garbage collectors. Here are some blogs that verify the improvement:

It's questionable if Recycler is still useful in modern Java versions. Even if assuming a usage of Recycler can reduce the GC pressure significantly in a scenario, the benefit can be easily broken by a slight change of the code, which is hard to be noticed and verified.

In Pulsar, Netty Recycler is typically used to wrap a list of objects to avoid passing them as method parameters, which makes methods too long with duplicated parameters. For example, ManagedLedgerImpl#asyncReadOpEntry accepts an OpReadEntry instance, which is recycable and includes many fields. This method is called for each read operation, which should happen frequently, so technically it can save heap memory allocations.

However, as the code becomes complicated, after being passed to the entry cache to process, the OpReadEntry is wrapped as another callback object: https://github.com/apache/pulsar/blob/d833b8bef21cb9e85f0f313eb9d49c7ca550fbbd/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeEntryCacheImpl.java#L349-L365

Now, if the recycler design on OpReadEntry saved 1 of 1 heap memory allocation for each read operation, now it just saves 1 of 2 heap memory allocations. What's more, this wrapped callback is passed to PendingReadsManager and eventually used as a field of a new record type ReadEntriesCallbackWithContext to an ArrayList in https://github.com/apache/pulsar/blob/d833b8bef21cb9e85f0f313eb9d49c7ca550fbbd/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/PendingReadsManager.java#L256

Now, each read operation has:

1 object allocated by Netty recycler (OpReadEntry)
1 object allocated from heap memory (the wrapped callback in RangeEntryCacheImpl)
1 object allocated from heap memory (the ReadEntriesCallbackWithContext object that has a field of the wrapped callback)
- This object is added to an ArrayList allocated from heap memory with at least 1 element

Eventually, only saving the 1st heap memory allocation should not make a difference. Actually, many optimizations in garbage collectors can prevent short-lived objects from being allocated from heap memory.

Goals

Make a rule for whether to use Netty Recycler to avoid debates when reviewing code like “the recycler is good that you should not remove it”.

High Level Design

For new contributions, don't use Netty Recycler.

If you need to save heap memory usage, consider using FastThreadLocal to cache a single object or an object queue like What Netty Recycler does for the Good Case mentioned in the Background knowledge section. This would also be more effective for normal threads than Netty Recycler.

For existing code, generally we should not pay efforts to remove recyclable classes. But it's acceptable to remove them especially when:

A known recycler issue is reported but the reason is hard to analyze. (example: #24741)
The existing recyclable object makes code refactoring hard

Alternatives

Add an option to allow whether to use Netty recycler

This would be helpful to figure out if the GC pressure can be reduced once it‘s enabled. However, the code would be much more complicated. Additionally, measuring the GC pressure is not easy. In many dashboards, the GC time is mistakenly called as “GC pause time”, where there is no actual pause. When objects aren’t recycled, it‘s expected for some use cases that more CPU will be spent in GC. However the Netty Recycler consumes a significant share of CPU too and there’s no metric, so just looking at GC metrics increasing wouldn't be a way how to see if getting rid of Netty Recyclers is useful.