Netty Recycler
is widely used in many places to reduce GC pressure by reusing objects. From the outputs of the following command:
find . -name "*.java" | grep src/main | xargs grep -n "new .*Recycler"
You can find:
pulsar-common
pulsar-broker
pulsar-client
However, many of them are only the result of cargo cult programming, which blindly believes objects allocated from the Recycler
is better than allocated from heap memory. Hence, we should understand how it works first.
Recycler is basically a thread-local object pool that provides
get
method to allocate an object from the pool or call the user-provided newObject
method to create a new one if the pool is empty or full.recycle
method to return an object back to the pool.Here is a typical implementation that wraps a Recycler
to manage a triple:
@Data // support getter and setter for all fields public class RecyclablePair { private final Recycler.Handle<RecyclablePair> handle; private Object x; private Object y; private RecyclablePair(Handle<RecyclablePair> handle) { this.handle = handle; } public void recycle() { this.x = null; this.y = null; handle.recycle(this); } public static RecyclablePair create(Object x, Object y) { final RecyclablePair s = RECYCLER.get(); s.x = x; s.y = y; return s; } private static final Recycler<RecyclablePair> RECYCLER = new Recycler<RecyclablePair>() { @Override protected RecyclablePair newObject(Handle<RecyclablePair> handle) { return new RecyclablePair(handle); } }; }
It's typically used with a “recycle-after-use” pattern like:
var triple = RecyclablePair.create(a, b); process(triple); triple.recycle();
or
var triple = RecyclablePair.create(a, b); executor.execute(() -> { process(triple); triple.recycle(); });
Unlike a simple thread local pool, recycling in a different thread is supported because in this case, recycle()
will push the object (triple
in the code above) into a queue that implements MessagePassingQueue.
The following code snippet of the thread local pool used by Recycler
includes some key parts:
private static final class LocalPool<T> implements MessagePassingQueue.Consumer<DefaultHandle<T>> { private final ArrayDeque<DefaultHandle<T>> batch; private volatile Thread owner; private volatile MessagePassingQueue<DefaultHandle<T>> pooledHandles;
When get
is called, it firsts get the thread local LocalPool
and apply the following logic:
pooledHandles
to batch
if batch
is empty.batch
and mark it as “claimed” (used by the current thread).handle
is available, call the get
method to return the object. In the example above, RecyclablePair#create
calls this get
method to get a pooled object and initializes some fields.newObject
to create a new object from heap memory and associated with a new handle. With the default config, the handle is a “no-op handle” in 7/8 cases that won't push the object into pooledHandles
when recycle
is called. In the rest 1/8 cases, the handle type is a DefaultHandle
that is associated with the LocalPool
in this thread and can be pushed to pooledHandles
and batch
When recycle
is called if the handle is a DefaultHandle
:
get()
in a FastThreadLocalThread
and recycle()
is called in the same thread, the handle is pushed to batch
directly.pooledHandles
queue, which is less efficient.There are mainly three issues of using Recycler
in Pulsar.
As explained in the previous section, we can see that if the recyclable object like RecyclablePair
is allocated not in a FastThreadLocalThread
, it's not efficient. Here is an existing benchmark result running on a GitHub Actions Ubuntu 24.04 runner:
recycler.RecyclerBenchmark.testRecord thrpt 25 468034078.305 ± 1453202.080 ops/s recycler.RecyclerBenchmark.testThreadLocalSingletonRecycler thrpt 25 137434124.938 ± 173674.317 ops/s recycler.RecyclerBenchmark.testThreadLocalQueueRecycler thrpt 25 70585776.893 ± 418944.351 ops/s recycler.RecyclerBenchmark.testRecycler thrpt 25 36061222.443 ± 368182.450 ops/s
testRecord
allocates a simple tuple record
object from heap memorytestThreadLocalSingletonRecycler
allocates a tuple object from a thread local objecttestThreadLocalQueueRecycler
allocates a recyclable tuple object from a thread local queue (ArrayDeque
) and add it to the queue (it simulates the case that recycle
is called in a FastThreadLocalThread
and recycled in the same thread)testRecycler
allocates a recyclable tuple object using Recycler
and calling recycle
thenThe full benchmark code is here.
From the results above, the performance is about 7x worse than an object from heap memory in the good case and 13x worse in the bad case.
Compared to the performance overhead, the error-prone usage is a more serious issue. Users have to call recycle()
explicitly for each object like the error-prone malloc/free
in C language.
var triple = RecyclablePair.create(a, b); process(triple); triple.recycle();
For example, if process
in the code above throws an exception, recycle()
will be skipped. #13233 is a real world fix for such issue. Though the issue is not serious because the worst case is that the pool is full and then the object will be allocated from heap buffer.
A more serious issue is that the object is shared among different threads and then recycle()
are called twice, which will throw an exception for the 2nd recycle call. #24697 suffers this issue and was not fixed when this proposal is written. Since the exception is thrown by recycle()
during the read path, it affects the client side that the consumer cannot receive messages anymore. #24725 fixes an issue that a recyclable object is unexpected shared between two threads, while if the object is not recyclable, the worst case is that the same batched message is acknowledged twice.
The most serious issue is that if the object is shared by two threads and one thread has recycled it, the second thread could access:
recycle()
get()
The 1st case might lead to NPE, which might be relatively fine because it's exposed in logs. The 2nd case is really dangerous because the object is in an inconsistent state and might not be exposed with any exception.
Since the ownership of a recyclable object must be unique, the move semantics can be implemented like:
@RequiredArgsConstructor @Getter static class MovableRecyableObject { private static final Recycler<MovableRecyableObject> RECYCLER = new Recycler<MovableRecyableObject>() { @Override protected MovableRecyableObject newObject(Handle<MovableRecyableObject> handle) { return new MovableRecyableObject(handle); } }; private final Recycler.Handle<MovableRecyableObject> handle; private Object object; public static MovableRecyableObject create(Object object) { final var result = RECYCLER.get(); result.object = object; return result; } public MovableRecyableObject move() { final var result = MovableRecyableObject.create(this.object); this.recycle(); return result; } public void recycle() { this.object = null; handle.recycle(this); } }
Before publishing the recyclable object to a different thread, move()
must be called, so that if the object is accessed by a third thread later, it will fail fast with NPE. However, this makes code harder to write that developers have to follow this pattern. Not calling move()
won't lead to a compile-time issue or runtime issue.
Without following the move-before-publish pattern, we have to make recycle
synchronized, which will have more performance overhead:
public synchronized void recycle() { if (this.object != null) { this.object = null; handle.recycle(this); } }
How much GC pressure can be reduced by sacrificing the memory allocation latency is hard to measure and never proved by any benchmark. When Recycler
was introduced in Netty, Java 8 was even not released. The main use case of Recycler
in Netty is for ByteBuf
to avoid the time-consuming buffer memory allocation from direct memory, where the buffer is used as a request for users to process in the FastThreadLocalThread
.
However, Java GC has involved a lot in these years. For example, Generational ZGC, which is introduced since Java 21, performs much better on tail latencies than the previous garbage collectors. Here are some blogs that verify the improvement:
It's questionable if Recycler
is still useful in modern Java versions. Even if assuming a usage of Recycler
can reduce the GC pressure significantly in a scenario, the benefit can be easily broken by a slight change of the code, which is hard to be noticed and verified.
In Pulsar, Netty Recycler
is typically used to wrap a list of objects to avoid passing them as method parameters, which makes methods too long with duplicated parameters. For example, ManagedLedgerImpl#asyncReadOpEntry
accepts an OpReadEntry
instance, which is recycable and includes many fields. This method is called for each read operation, which should happen frequently, so technically it can save heap memory allocations.
However, as the code becomes complicated, after being passed to the entry cache to process, the OpReadEntry
is wrapped as another callback object: https://github.com/apache/pulsar/blob/d833b8bef21cb9e85f0f313eb9d49c7ca550fbbd/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeEntryCacheImpl.java#L349-L365
Now, if the recycler design on OpReadEntry
saved 1 of 1 heap memory allocation for each read operation, now it just saves 1 of 2 heap memory allocations. What's more, this wrapped callback is passed to PendingReadsManager
and eventually used as a field of a new record type ReadEntriesCallbackWithContext
to an ArrayList
in https://github.com/apache/pulsar/blob/d833b8bef21cb9e85f0f313eb9d49c7ca550fbbd/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/PendingReadsManager.java#L256
Now, each read operation has:
OpReadEntry
)RangeEntryCacheImpl
)ReadEntriesCallbackWithContext
object that has a field of the wrapped callback)ArrayList
allocated from heap memory with at least 1 elementEventually, only saving the 1st heap memory allocation should not make a difference. Actually, many optimizations in garbage collectors can prevent short-lived objects from being allocated from heap memory.
Make a rule for whether to use Netty Recycler
to avoid debates when reviewing code like “the recycler is good that you should not remove it”.
For new contributions, don't use Netty Recycler
.
If you need to save heap memory usage, consider using FastThreadLocal
to cache a single object or an object queue like What Netty Recycler
does for the Good Case mentioned in the Background knowledge section. This would also be more effective for normal threads than Netty Recycler
.
For existing code, generally we should not pay efforts to remove recyclable classes. But it's acceptable to remove them especially when:
This would be helpful to figure out if the GC pressure can be reduced once it‘s enabled. However, the code would be much more complicated. Additionally, measuring the GC pressure is not easy. In many dashboards, the GC time is mistakenly called as “GC pause time”, where there is no actual pause. When objects aren’t recycled, it‘s expected for some use cases that more CPU will be spent in GC. However the Netty Recycler consumes a significant share of CPU too and there’s no metric, so just looking at GC metrics increasing wouldn't be a way how to see if getting rid of Netty Recyclers is useful.