GH-1038: Trim object memory for ArrowBuf (#1044) ## What's Changed A significant number of ArrowBuf and BufferLedger objects are created during certain workloads. Saving several bytes per instance could add up to significant memory savings and reduced memory allocation expense and garbage collection. The id field, which was a sequential value used when logging object information, is replaced with an identity hash code. This should still allow enough information for debugging without the memory overhead. There may be possible duplicate values but it shouldn't matter for logging purposes. Atomic fields can be replaced by a primitive and a static updater which saves several bytes per instance. ### ArrowBuf | Component | Before | After | Savings | |-----------|--------|-------|---------| | `idGenerator` (static) | `AtomicLong` | Removed | 24 bytes globally | | `id` field (per instance) | `long` (8 bytes) | Removed | **8 bytes per instance** | | `getId()` | Returns `id` field | Returns `System.identityHashCode(this)` | — | ### BufferLedger | Component | Before | After | Savings | |-----------|--------|-------|---------| | `LEDGER_ID_GENERATOR` (static) | `AtomicLong` | Removed | 24 bytes globally | | `ledgerId` (per instance) | `long` (8 bytes) | Removed | **8 bytes per instance** | | `bufRefCnt` | `AtomicInteger` (24 bytes) | `volatile int` + static updater | **20 bytes per instance** | ### Total Savings | Scale | ArrowBuf | BufferLedger | Combined | |-------|----------|--------------|----------| | 100K | 800 KB | 2.8 MB | **~3.6 MB** | | 1M | 8 MB | 28 MB | **~36 MB** | | 10M | 80 MB | 280 MB | **~360 MB** | ### Benchmarking I ran the added benchmark before and after the metadata trimming. **Metadata Trimmed** | Benchmark | Mode | Score | Error |Units| |-------|----------|--------------|----------|----------| |MemoryFootprintBenchmarks.measureAllocationPerformance | avgt | 456.831 |± 36.059 | us/op| |MemoryFootprintBenchmarks.measureArrowBufMemoryFootprint | ss | 161.085 |± 35.596| ms/op| |Created 100000 ArrowBuf instances. Heap memory used | sum | 35631520 bytes (33.98 MB) |0 |bytes| |Average memory per ArrowBuf| sum | 356.32 bytes |0 |bytes| **Previous Object Layout** | Benchmark | Mode | Score | Error |Units| |-------|----------|--------------|----------|----------| |MemoryFootprintBenchmarks.measureAllocationPerformance | avgt | 466.171 |± 16.233 | us/op| |MemoryFootprintBenchmarks.measureArrowBufMemoryFootprint | ss | 176.790 |± 17.943 |ms/op| |Created 100000 ArrowBuf instances. Heap memory used | sum | 38817480 bytes (37.02 MB) |0 |bytes| |Average memory per ArrowBuf| sum | 388.17 bytes |0 |bytes| Closes #1038.
The following guides explain the fundamental data structures used in the Java implementation of Apache Arrow.
Generated javadoc documentation is available here.
Refer to Building Apache Arrow for documentation of environment setup and build instructions.
Arrow uses Google's Flatbuffers to transport metadata. The java version of the library requires the generated flatbuffer classes can only be used with the same version that generated them. Arrow packages a version of the arrow-vector module that shades flatbuffers and arrow-format into a single JAR. Using the classifier “shade-format-flatbuffers” in your pom.xml will make use of this JAR, you can then exclude/resolve the original dependency to a version of your choosing.
$ flatc --version flatc version 25.1.24 $ grep "dep.fbs.version" pom.xml <dep.fbs.version>25.1.24</dep.fbs.version>
cd $ARROW_HOME # remove the existing files rm -rf format/src # regenerate from the .fbs files flatc --java -o format/src/main/java arrow-format/*.fbs # prepend license header mvn spotless:apply -pl :arrow-format
There are several system/environmental variables that users can configure. These trade off safety (they turn off checking) for speed. Typically they are only used in production settings after the code has been thoroughly tested without using them.
Bounds Checking for memory accesses: Bounds checking is on by default. You can disable it by setting either the system property(arrow.enable_unsafe_memory_access) or the environmental variable (ARROW_ENABLE_UNSAFE_MEMORY_ACCESS) to true. When both the system property and the environmental variable are set, the system property takes precedence.
null checking for gets: ValueVector get methods (not getObject) methods by default verify the slot is not null. You can disable it by setting either the system property(arrow.enable_null_check_for_get) or the environmental variable (ARROW_ENABLE_NULL_CHECK_FOR_GET) to false. When both the system property and the environmental variable are set, the system property takes precedence.
-Dio.netty.tryReflectionSetAccessible=true should be set. This fixes java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available. thrown by Netty.StructVector enable -Darrow.struct.conflict.policy=CONFLICT_APPEND. Duplicate fields are ignored (CONFLICT_REPLACE) by default and overwritten. To support different policies for conflicting or duplicate fields set this JVM flag or use the correct static constructor methods for StructVectors.Arrow Java follows the Google Java Style Guide with the following differences:
NoFinalizer, OverloadMethodsDeclarationOrder, and VariableDeclarationUsageDistance due to the existing code base. These rules should be followed when possible.Refer to checkstyle.xml for rule specifics.
When running tests, Arrow Java uses the Logback logger with SLF4J. By default, it uses the logback.xml present in the corresponding module's src/test/resources directory, which has the default log level set to INFO. Arrow Java can be built with an alternate logback configuration file using the following command run in the project root directory:
mvn -Dlogback.configurationFile=file:<path-of-logback-file>
See Logback Configuration for more details.
Integration tests which require more time or more memory can be run by activating the integration-tests profile. This activates the Maven Failsafe plugin and any class prefixed with IT will be run during the testing phase. The integration tests currently require a larger amount of memory (>4GB) and time to complete. To activate the profile:
mvn -Pintegration-tests <rest of mvn arguments>