blob: d5cff6ad67f3f5c6676f3710de88a8a550ee48fb [file] [log] [blame]
.. Copyright 2024-present Alibaba Inc.
.. Licensed under the Apache License, Version 2.0 (the "License");
.. you may not use this file except in compliance with the License.
.. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.
Write And Prepare Commit
========================
Batch writing requires the compute engine to pre-bucket data (bucket), using the
same bucketing strategy as Paimon to ensure correct ``Scan`` behavior, and to
specify the target ``partition``. Data should be accumulated into ``RecordBatch``
and written to Paimon.
Paimon C++ uses Apache Arrow as the :ref:`in-memory columnar format<memory-format>`
to more efficiently support writing to disk columnar formats such as ORC and
Parquet, thereby improving write throughput.
.. note::
Currently supported table types:
- Append table
- Primary Key table
Not supported in the current scope:
- Changelog
- Indexes
Bucketing Modes
---------------
- Append tables:
* Support ``bucket = -1`` (dynamic bucket mode)
* Support ``bucket > 0`` (fixed bucket mode)
- PK tables:
* Support ``bucket > 0`` (fixed bucket mode)
.. note::
PK tables do not support dynamic bucketing (``bucket = -1``).
RecordBatch Construction
------------------------
- The compute engine must:
- Apply the Paimon-consistent bucketing function to each row prior to batching.
- Assign the correct ``partition`` for each row.
- Group rows into Arrow ``RecordBatch`` per partition-bucket combination to minimize writer state changes and I/O overhead.
- Recommended practices:
- Use schema-aligned Arrow arrays with explicit validity bitmaps and offsets.
- Prefer batch sizes tuned for I/O throughput (e.g., tens to hundreds of MB per flush, depending on filesystem and cluster configuration).
- Maintain stable sort orders within a batch only if required by downstream merge or compaction logic; otherwise avoid unnecessary ordering costs.
Prepare Commit
----------------
The compute engine is responsible for triggering the writer nodes' ``PrepareCommit``.
Triggering conditions depend on the engine’s business needs and can follow either:
- Streaming mode: time-based or periodic triggers (e.g., every N seconds).
- Batch mode: trigger after all data in the batch has been written.
Once the compute engine collects ``CommitMessages`` from all writer nodes, it
can issue a ``Commit`` request to the control plane (management path) to create
a new ``Snapshot``.
Compatibility Goals
~~~~~~~~~~~~~~~~~~~
To ensure interoperability, the ``PrepareCommit`` result produced by Paimon C++
must be consumable by Paimon Java. Therefore:
- The structure and semantics of ``CommitMessage`` must remain consistent with
Java Paimon.
- Any evolution of the Java-side ``CommitMessage`` schema must be tracked and
validated on the C++ side to maintain cross-language compatibility.
Interface Design in Paimon C++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Unlike Java Paimon, Paimon C++ does not expose ``BinaryRow``-like types in its
public interfaces. To preserve compatibility without leaking internal row
representations, Paimon C++ provides ``CommitMessage`` only through:
- Serialization: convert the internal commit state into a well-defined binary
representation that matches Java Paimon’s expectations.
- Deserialization: parse the Java-compatible binary representation back into
C++ commit structures for validation, replay, or tooling needs.
This design ensures that:
- Public APIs are independent of Java-specific row abstractions.
- Cross-language commit payloads remain stable and versionable.
- Internal data layouts can evolve without breaking external consumers.
CommitMessage Contract
~~~~~~~~~~~~~~~~~~~~~~
The ``CommitMessage`` must encode all information required by the coordinator to
produce a correct ``Snapshot``, which commonly includes (but is not limited to):
- Partition and bucket identifiers associated with written data.
- New data files, delete files, or changelog artifacts (as applicable to the table type).
- File-level metadata required for manifest and index updates (e.g., row counts, min/max statistics where applicable).
- Transactional markers and sequence numbers as required by table semantics.
- Any per-writer state necessary for deduplication or idempotent commits.
.. note::
Current C++ scope supports Append and PK tables. Changelog and index
artifacts are out of scope and should not be emitted in ``CommitMessage`` until
explicitly supported.
Serialization and Deserialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Binary Format:
- The binary payload must strictly conform to Java Paimon’s ``CommitMessage`` encoding.
- Version tags or schema identifiers should be included to enable forwards/backwards compatibility and safe upgrades.
- Serialization API:
- Provide a function to serialize the writer’s commit state into a byte buffer (or stream) consumable by Java Paimon.
- Deserialization API:
- Provide a function to parse a Java-produced ``CommitMessage`` binary payload back into C++ commit structures for verification, replay, and testing.
- Validation:
- Include conformance tests to assert that C++ serialized payloads are accepted by Java Paimon.
- Include round-trip tests to ensure C++ can parse Java-produced payloads and vice versa for supported message versions.
Operational Flow
~~~~~~~~~~~~~~~~~~~~~~~
1. Writer nodes perform data ingestion and produce Arrow ``RecordBatch``
organized by partition and bucket.
2. Writers flush batches into ORC/Parquet files via registered ``file.format``
and ``file-system`` backends, producing file-level metadata and per-batch
commit state.
3. Each writer invokes ``PrepareCommit``, which:
- Aggregates per-writer state into a ``CommitMessage``.
- Serializes the message into a Java-compatible binary payload.
4. The compute engine gathers ``CommitMessages`` from all writers.
5. The compute engine issues a ``Commit`` request to the control plane with the
collected messages, resulting in a new ``Snapshot``.
6. The coordinator validates the messages, updates manifests/metadata, and
finalizes the snapshot atomically.