This post gives insight into the major improvements in the Java implementation of vectors. We undertook this work over the last 10 weeks since the last Arrow release.
We use templates in several places for compile time Java code generation for different vector classes, readers, writers etc. Templates are helpful as the developers don't have to write a lot of duplicate code.
However, we realized that over a period of time some specific Java templates became extremely complex with giant if-else blocks, poor code indentation and documentation. All this impacted the ability to easily extend these templates for adding new functionality or improving the existing infrastructure.
So we evaluated the usage of templates for compile time code generation and decided not to use complex templates in some places by writing small amount of duplicate code which is elegant, well documented and extensible.
We did extensive memory analysis downstream in Dremio where Arrow is used heavily for in-memory query execution on columnar data. The general conclusion was that Arrow's Java vector classes have non-negligible heap overhead and volume of objects was too high. There were places in code where we were creating objects unnecessarily and using structures that could be substituted with better alternatives.
Java vectors used delegation and abstraction heavily throughout the object hierarchy. The performance critical get/set methods of vectors went through a chain of function calls back and forth between different objects before doing meaningful work. We also evaluated the usage of branches in vector APIs and reimplemented some of them by avoiding branches completely.
We took inspiration from how the Java memory code in ArrowBuf
works. For all the performance critical methods, ArrowBuf
bypasses all the netty object hierarchy, grabs the target virtual address and directly interacts with the memory.
There were cases where branches could be avoided all together.
In case of nullable vectors, we were doing multiple checks to confirm if the value at a given position in the vector is null or not.