| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. Borrowed the file from Apache Paimon: |
| .. https://github.com/apache/paimon/blob/master/docs/content/concepts/basic-concepts.md |
| |
| Basic Concepts |
| ======================== |
| |
| File Layouts |
| ------------------------ |
| All files of a table are stored under one base directory. Paimon files are |
| organized in a layered style. The following image illustrates the file layout. |
| Starting from a snapshot file, Paimon readers can recursively access all records |
| from the table. |
| |
| .. image:: _static/file-layout.png |
| :alt: File Layout |
| :align: center |
| :width: 100% |
| |
| Snapshot |
| ------------------- |
| All snapshot files are stored in the snapshot directory. |
| |
| A snapshot file is a JSON file containing information about this snapshot, |
| including the schema file in use the manifest list containing all changes of |
| this snapshot. A snapshot captures the state of a table at some point in time. |
| Users can access the latest data of a table through the latest snapshot. |
| By time traveling, users can also access the previous state of a table through |
| an earlier snapshot. |
| |
| Manifest Files |
| ------------------- |
| All manifest lists and manifest files are stored in the manifest directory. A |
| manifest list is a list of manifest file names. |
| A manifest file is a file containing changes about LSM data files and changelog |
| files. For example, which LSM data file is created and which file is deleted in |
| the corresponding snapshot. |
| |
| Data Files |
| --------------------------- |
| Data files are grouped by partitions. Currently, Paimon supports using parquet |
| (default), orc and lance as data file’s format. |
| |
| .. note:: |
| avro write as a data file format is not supported yet. |
| |
| Partition |
| --------------------------- |
| Paimon adopts the same partitioning concept as Apache Hive to separate data. |
| |
| Partitioning is an optional way of dividing a table into related parts based on |
| the values of particular columns like date, city, and department. Each table can |
| have one or more partition keys to identify a particular partition. |
| |
| By partitioning, users can efficiently operate on a slice of records in the |
| table. |
| |
| Consistency Guarantees |
| --------------------------- |
| Paimon writers use two-phase commit protocol to atomically commit a batch of |
| records to the table. Each commit produces at most two snapshots at commit time. |
| It depends on the incremental write and compaction strategy. If only incremental |
| writes are performed without triggering a compaction operation, only an |
| incremental snapshot will be created. If a compaction operation is triggered, an |
| incremental snapshot and a compacted snapshot will be created. |
| |
| For any two writers modifying a table at the same time, as long as they do not |
| modify the same partition, their commits can occur in parallel. If they modify |
| the same partition, only snapshot isolation is guaranteed. That is, the final |
| table state may be a mix of the two commits, but no changes are lost. |
| |
| .. note:: |
| Paimon C++ currently does not support compaction. |