blob: 506db2a29b3ab4662fc5851b5c4575a190ce9a1d [file] [log] [blame]
.. Copyright 2024-present Alibaba Inc.
.. Licensed under the Apache License, Version 2.0 (the "License");
.. you may not use this file except in compliance with the License.
.. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.
Read and Data Evolution
===============================================
Paimon by functionality can be divided into two layers:
- Control Plane: Responsible for accessing and managing Meta (snapshot, manifest, etc.), including:
- Catalog / Database access
- Table retrieval
- Collection and resolution of data files
- Data Plane: Responsible for accessing actual data files, including:
- Readers for various file formats
- Coordinated reading of file collections
The control plane and data plane interact primarily via DataSplit (the query plan). Java currently supports a standard
DataSplit protocol which includes the necessary meta information to access data files. With DataSplit, a high-performance
data access path can be integrated.
At compute time, the execution engine (reader) does not need to be aware of the concrete table type or its metadata details.
It only needs to follow the instructions within the DataSplit (query plan) to perform data reading operations.
With the layered abstraction of the control plane and data plane, and the use of DataSplit as a stable protocol interface,
the two layers can evolve their functionality and optimize code relatively independently. This design also enables
cross-language task scheduling and interaction (e.g., Java and C++), substantially reducing engineering maintenance costs
across the two language ecosystems.
Schema Evolution
================
Scope and Compatibility
-----------------------
C++ Paimon supports all evolution kinds available in Java Paimon for non-nested types:
- Add column
- Drop column
- Reorder columns
- Rename column
- Change column type
.. note::
- Only non-nested type evolution is supported. Nested columns (struct, array, map) are not supported.
- Partition keys: Only column reordering is supported; other operations are not supported (consistent with Java Paimon).
- Primary key:
- Adding or dropping columns is not supported.
- Other operations are supported (consistent with Java Paimon).
Per-File Schema via Field IDs
-----------------------------
In DataSplit, each file may have a completely different data schema. Paimon uses field IDs to uniquely identify fields.
Overflow Behavior Disclaimer
----------------------------
Overflow behavior is undefined for C++ and Java Paimon. Results in overflow scenarios may:
- Be incorrect values,
- Return an error status,
- Or be null.
C++ Paimon does not guarantee identical results to Java Paimon in overflow scenarios. Users should not rely on identical
return values between implementations.
Type Change Support Matrix
--------------------------
The table below indicates support for changing a column type from ``source`` to ``target``. Refer to the numbered notes below the table
for caveats.
.. list-table::
:header-rows: 1
:widths: 12 10 10 10 10 10 10 8 12 10 8 18 10
* - src \\ target
- tinyint
- smallint
- int
- bigint
- float
- double
- bool
- string
- binary
- date
- timestamp (without tz)
- decimal
* - tinyint
-
-
-
-
-
-
-
-
-
-
-
-
* - smallint
- 1️⃣
-
-
-
-
-
-
-
-
-
-
-
* - int
- 1️⃣
- 1️⃣
-
-
-
-
-
-
-
-
- 1️⃣
-
* - bigint
- 1️⃣
- 1️⃣
- 1️⃣
-
-
-
-
-
-
-
- 6️⃣
-
* - float
- 2️⃣
- 2️⃣
- 2️⃣
- 2️⃣
-
-
-
- 3️⃣ 4️⃣
-
-
-
-
* - double
- 2️⃣
- 2️⃣
- 2️⃣
- 2️⃣
- 2️⃣
-
-
- 3️⃣ 4️⃣
-
-
-
-
* - bool
-
-
-
-
-
-
-
-
-
-
-
-
* - string
-
-
-
-
- 3️⃣
- 3️⃣
-
-
-
-
- 5️⃣
- 7️⃣
* - binary
-
-
-
-
-
-
-
-
-
-
-
-
* - date
-
-
-
-
-
-
-
-
-
-
- 5️⃣
-
* - timestamp (without tz)
-
-
- 1️⃣
-
-
-
-
-
-
-
-
-
* - decimal
- 1️⃣
- 1️⃣
- 1️⃣
- 1️⃣
-
-
-
-
-
-
-
-
.. admonition:: Overflow Behavior Notes
:class: note
1️⃣ Integer downcast overflow behavior matches Java in specific cases.
Example: smallint -> tinyint, 32767 becomes -1; int -> smallint, -2147483648 becomes 0.
2️⃣ Floating-point overflow behavior is partially consistent with Java and partially different.
Example: float -> tinyint
- Java: MAX_FLOAT -> -1, INFINITY -> -1
- C++: MAX_FLOAT -> 0, INFINITY -> 0
3️⃣ Keyword differences for special float/double values:
- Java: Infinity, -Infinity, NaN
- C++: inf, -inf, nan
4️⃣ Printing difference:
- C++ Paimon prints 1.0 as ``1``
- Java Paimon prints 1.0 as ``1.0``
5️⃣ Timestamp precision and range differences:
- Java Paimon: 0000-01-01 00:00:00.000000000 to 9999-12-31 23:59:59.999999999
- C++ Paimon: 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807
- C++ only supports nanosecond precision; range is smaller.
6️⃣ bigint -> timestamp range differences:
- Java Paimon (ms): ``[MIN_INT64/1000, MAX_INT64/1000]`` seconds
- C++ Paimon (ns): ``[MIN_INT64/1e9, MAX_INT64/1e9]`` seconds
7️⃣ string -> decimal with precision > 38:
- C++ returns ``null`` if parsing would overflow 128-bit arithmetic.
- Java may rescale and return a value based on the rescaled precision.
- Example input: ``1111111111111111111111111111111111111.15``, Java returns: ``1111111111111111111111111111111111111.2``, C++ returns: ``null``
Implementation Guidance
-----------------------
- Use DataSplit as the sole interface between control and data planes. Treat it as the canonical query plan contract.
- Resolve field types and IDs per file; prefer inline data file metadata, fallback to table schema files when necessary.
- Expect per-file schema variability; design readers to align by field IDs rather than positional indices.
- Do not assume identical overflow semantics across C++ and Java; tests should validate acceptable ranges and nullability.
- For timestamp handling, consider precision/range constraints in C++ when interoperating with Java-produced data splits.