docs/source/user_guide/arrow.rst - doris-thirdparty - Git at Google

 .. Copyright 2024-present Alibaba Inc.

 .. Licensed under the Apache License, Version 2.0 (the "License");
 .. you may not use this file except in compliance with the License.
 .. You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing, software
 .. distributed under the License is distributed on an "AS IS" BASIS,
 .. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 .. See the License for the specific language governing permissions and
 .. limitations under the License.

 .. _memory-format:

 Memory Format
 =============

 `Paimon Java <https://paimon.apache.org/docs/master/>`_ uses a row-level
 abstraction, ``InternalRow``, as the default data format interface.
 Row-level interfaces are generally more intuitive for programming and can be
 easily integrated into data processing pipelines such as filtering and merge sorting.
 However, this approach introduces conversion and access overhead,
 making it difficult to fully leverage the additional performance benefits provided
 by modern CPU SIMD vectorization.

 Considering that the C++ implementation focuses more on end-to-end performance,
 and that underlying data file formats (e.g., ORC, Parquet) are predominantly
 columnar, providing a columnar-centric data abstraction can minimize overhead
 during data flow, integrate more naturally with vectorized execution engines,
 and ultimately deliver superior overall performance.

 Why Apache Arrow
 ----------------

 Apache Arrow is currently the most widely adopted in-memory columnar format and
 has strong native support for Parquet and ORC. It is well-integrated across
 open-source engines including Spark, Pandas, Drill, Impala, and Velox.

 Overall, using Apache Arrow as the in-memory format for Paimon C++ allows us to:

 - Maximize columnar performance and efficient data access patterns.
 - Seamlessly integrate with the Apache Arrow ecosystem and tooling.
 - Benefit from mature interoperability with popular data systems and formats.

 Versioning and Dependency Concerns
 ----------------------------------

 One important consideration is that Arrow is an active open-source project with
 a broad and evolving C++ surface area that spans data formats, compute kernels,
 and file I/O. Due to frequent releases and a large API surface, different Arrow
 C++ SDK versions can introduce API incompatibilities.

 If Paimon C++ directly depends on the full Arrow C++ SDK, it may conflict with
 existing Arrow C++ dependencies in other compute engines, raising integration
 costs and increasing long-term maintenance complexity.

 Adopting the Arrow C Data Interface
 -----------------------------------

 To leverage Arrow’s performance and ecosystem benefits while avoiding tight
 coupling to specific Arrow C++ SDK versions, we use the Arrow C Data
 Interface as the default in-memory format for Paimon C++.

 Key advantages:

 - Version-neutral: The C Data Interface is designed to be stable and forward-compatible
   across Arrow versions.
 - Compiler-neutral: C language interfaces avoid ABI friction commonly seen with
   C++ compilers and standard libraries.
 - Broad interoperability: The C Data Interface is supported by Arrow-based
   systems and enables zero-copy or minimal-copy interchange of columnar data.

 Design Principles
 -----------------

 - Columnar-first abstraction:
   Paimon C++ will represent in-memory data using columnar buffers and schemas
   compatible with the Arrow C Data Interface, minimizing transformation overhead.

 - Minimal dependency footprint:
   Prefer stable C interfaces and lightweight utility layers; avoid linking
   against the full Arrow C++ SDK unless strictly necessary and well-isolated.

 - Vectorization-aware execution:
   Structure data layouts to align with SIMD processing (e.g., contiguous
   buffers, clear null bitmaps and type-specific arrays), enabling efficient
   filtering, projection, and aggregation.

 - Interoperability:
   Ensure that data produced and consumed by Paimon C++ can be handed off to
   Arrow-compatible engines and libraries without expensive conversions.

 - Compatibility with columnar storage:
   Maintain efficient paths from columnar file formats (Parquet, ORC) to in-memory
   columnar representations, minimizing decoding and marshaling overhead.

 Implementation Outline
 ----------------------

 Schema and buffers
 ~~~~~~~~~~~~~~~~~~~
 - Represent schemas and arrays using Arrow C Data Interface types (e.g., ``ArrowSchema``, ``ArrowArray``) with clear ownership and lifecycle.
 - Support nested types (structs, lists, maps) and common primitives (integers, floats, decimals, timestamps).

 Memory management
 ~~~~~~~~~~~~~~~~~~~~
 - Define consistent ownership semantics for buffers and child arrays.
 - Employ reference counting or explicit release callbacks aligned with the Arrow C conventions.

 Nullability and validity
 ~~~~~~~~~~~~~~~~~~~~~~~~
 - Use standard validity bitmaps for nullability and adhere to Arrow’s canonical buffer organization (validity, offsets, data, etc.).

 Conversion boundaries
 ~~~~~~~~~~~~~~~~~~~~~~
 - Provide adapters to
   * Read from columnar file formats (Parquet/ORC) into Arrow-compatible ``ArrowArray`` structures.
   * Export/import data to other Arrow-compatible engines with zero or minimal copies.

 - Keep these adapters independent from heavy Arrow C++ SDK dependencies.
	.. Copyright 2024-present Alibaba Inc.

	.. Licensed under the Apache License, Version 2.0 (the "License");
	.. you may not use this file except in compliance with the License.
	.. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing, software
	.. distributed under the License is distributed on an "AS IS" BASIS,
	.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	.. See the License for the specific language governing permissions and
	.. limitations under the License.

	.. _memory-format:

	Memory Format
	=============

	`Paimon Java <https://paimon.apache.org/docs/master/>`_ uses a row-level
	abstraction, ``InternalRow``, as the default data format interface.
	Row-level interfaces are generally more intuitive for programming and can be
	easily integrated into data processing pipelines such as filtering and merge sorting.
	However, this approach introduces conversion and access overhead,
	making it difficult to fully leverage the additional performance benefits provided
	by modern CPU SIMD vectorization.

	Considering that the C++ implementation focuses more on end-to-end performance,
	and that underlying data file formats (e.g., ORC, Parquet) are predominantly
	columnar, providing a columnar-centric data abstraction can minimize overhead
	during data flow, integrate more naturally with vectorized execution engines,
	and ultimately deliver superior overall performance.

	Why Apache Arrow
	----------------

	Apache Arrow is currently the most widely adopted in-memory columnar format and
	has strong native support for Parquet and ORC. It is well-integrated across
	open-source engines including Spark, Pandas, Drill, Impala, and Velox.

	Overall, using Apache Arrow as the in-memory format for Paimon C++ allows us to:

	- Maximize columnar performance and efficient data access patterns.
	- Seamlessly integrate with the Apache Arrow ecosystem and tooling.
	- Benefit from mature interoperability with popular data systems and formats.

	Versioning and Dependency Concerns
	----------------------------------

	One important consideration is that Arrow is an active open-source project with
	a broad and evolving C++ surface area that spans data formats, compute kernels,
	and file I/O. Due to frequent releases and a large API surface, different Arrow
	C++ SDK versions can introduce API incompatibilities.

	If Paimon C++ directly depends on the full Arrow C++ SDK, it may conflict with
	existing Arrow C++ dependencies in other compute engines, raising integration
	costs and increasing long-term maintenance complexity.

	Adopting the Arrow C Data Interface
	-----------------------------------

	To leverage Arrow’s performance and ecosystem benefits while avoiding tight
	coupling to specific Arrow C++ SDK versions, we use the Arrow C Data
	Interface as the default in-memory format for Paimon C++.

	Key advantages:

	- Version-neutral: The C Data Interface is designed to be stable and forward-compatible
	across Arrow versions.
	- Compiler-neutral: C language interfaces avoid ABI friction commonly seen with
	C++ compilers and standard libraries.
	- Broad interoperability: The C Data Interface is supported by Arrow-based
	systems and enables zero-copy or minimal-copy interchange of columnar data.

	Design Principles
	-----------------

	- Columnar-first abstraction:
	Paimon C++ will represent in-memory data using columnar buffers and schemas
	compatible with the Arrow C Data Interface, minimizing transformation overhead.

	- Minimal dependency footprint:
	Prefer stable C interfaces and lightweight utility layers; avoid linking
	against the full Arrow C++ SDK unless strictly necessary and well-isolated.

	- Vectorization-aware execution:
	Structure data layouts to align with SIMD processing (e.g., contiguous
	buffers, clear null bitmaps and type-specific arrays), enabling efficient
	filtering, projection, and aggregation.

	- Interoperability:
	Ensure that data produced and consumed by Paimon C++ can be handed off to
	Arrow-compatible engines and libraries without expensive conversions.

	- Compatibility with columnar storage:
	Maintain efficient paths from columnar file formats (Parquet, ORC) to in-memory
	columnar representations, minimizing decoding and marshaling overhead.

	Implementation Outline
	----------------------

	Schema and buffers
	~~~~~~~~~~~~~~~~~~~
	- Represent schemas and arrays using Arrow C Data Interface types (e.g., ``ArrowSchema``, ``ArrowArray``) with clear ownership and lifecycle.
	- Support nested types (structs, lists, maps) and common primitives (integers, floats, decimals, timestamps).

	Memory management
	~~~~~~~~~~~~~~~~~~~~
	- Define consistent ownership semantics for buffers and child arrays.
	- Employ reference counting or explicit release callbacks aligned with the Arrow C conventions.

	Nullability and validity
	~~~~~~~~~~~~~~~~~~~~~~~~
	- Use standard validity bitmaps for nullability and adhere to Arrow’s canonical buffer organization (validity, offsets, data, etc.).

	Conversion boundaries
	~~~~~~~~~~~~~~~~~~~~~~
	- Provide adapters to
	* Read from columnar file formats (Parquet/ORC) into Arrow-compatible ``ArrowArray`` structures.
	* Export/import data to other Arrow-compatible engines with zero or minimal copies.

	- Keep these adapters independent from heavy Arrow C++ SDK dependencies.