| For historical reasons, append-optimized tables are also called |
| "append-only". They used to be truly append-only in previous versions |
| of Cloudberry, but these days they can in fact be updated and deleted |
| from. The segment files that store the tuples at the storage level are |
| append-only - data is never in-place updated or overwritten physically - |
| but from a user's point of view, the table can be updated and deleted |
| from like a heap table. The AO visibility map and MVCC on the metadata |
| tables make updates and deletes to work. |
| |
| Append-Optimized, Column Storage tables (AOCS), follow a similar |
| layout, but there is some extra metadata to track which segment file |
| corresponds to which column. The source code lives in a separate |
| directory, but was originally copy-pasted from append-only code, and |
| the high-level layout of the files is the same. |
| |
| |
| # Segment file format |
| |
| The tuples in an append-optimized table are stored in a number of |
| "segment files", or "segfiles" for short. The segment files are stored |
| in the data directory, in files name "`<relfilenode>.<segno>`", just like |
| heap tables. Unlike heap tables, each segment in an append-only table |
| can have different size, and some of them can even be missing. They can |
| also be larger than 1 GB, and are not expanded in BLCKSZ-sized blocks, |
| like heap segments are. Segment zero is empty unless data has been |
| inserted during utility mode, in which case it's inserted into segment |
| zero. Extending the table by adding attributes via ALTER TABLE will also |
| push data to segment zero. Each table can have at most 128 segment files. |
| |
| AOCS tables can similarly have at most 128 segment files for each column. |
| The range of segno is dependent on the filenum value in pg_attribute_encoding. |
| `segno 0,1-127 (filenum = 1), segno 128,129-255 (filenum = 2),...` |
| To find the file range for an AOCS table column, find the attnum of that column |
| and then find the corresponding filenum for that attnum from pg_attribute_encoding. |
| |
| An append-only segfile consists of a number of variable-sized blocks |
| ("varblocks"), one after another. The varblocks are aligned to 4 bytes. |
| Each block starts with a header. There are multiple kinds of blocks, with |
| different header information. Each block type begins with common "basic" |
| header information, however. The different header types are defined in |
| cdbappendonlystorage_int.h. The header also contains two checksums, |
| one for the header, and another for the whole block. |
| |
| |
| # pg_appendonly table |
| |
| The "`pg_appendonly`" catalog table stores extra information for each AO |
| table. You can think of it as an extension of `pg_class`. |
| |
| |
| # Aosegments table |
| |
| In addition to the segment files that store the user data, there are |
| three auxiliary heap tables for each AO table, which store metadata. |
| The aosegments table is one of them, which stores helpful attributes |
| such as `modcount`, which is a bearing on the number of DML operations |
| that the AO table has been subject to. |
| |
| The aosegments table is initially named: "`pg_aoseg.pg_aoseg_<oid>`", |
| where `<oid>` is the oid of the AO table. However, if certain DDL statements |
| such as ALTER, that involve a rewrite of the AO table on disk, are applied, |
| this aosegments table is replaced by a new table with a changed `<oid>` suffix. |
| The complete process involves creation of a temporary table followed by an |
| update of the `relfilenode` and aosegments table oid associated with the AO table. |
| |
| In order to find the current aosegments table of an AO table, the |
| "`pg_catalog.pg_appendonly`" catalog table must be queried. The `segrelid` column in |
| this table yields the oid of the current aosegments table. If we now query |
| pg_class with this oid, we will get the name of the current aosegments table. |
| Please note that the `segrelid` is not equal to the `<oid>` suffix of the current |
| aosegments table. |
| |
| Aosegment tables are similar to the TOAST tables, for heaps. An |
| append-only table can have a TOAST table, too, in addition to the |
| AO-specific auxiliary tables. |
| |
| The segment table lists all the segment files that exist. For each |
| segfile, it stores the current length of the table. This is used to |
| make MVCC work, and to make AO tables crash-safe. When an AO table |
| is loaded, the loading process simply appends new blocks to the end of |
| one of the segment files. Once the loading statement has completed, |
| it updates the segment table with the new EOF of that segment file. If |
| the transaction aborts, or the server crashes before end-of-transaction, |
| the EOF stored in the aosegment table stays unchanged, and readers |
| will ignore any data that was appended by the aborted transaction. The |
| segments table is a regular heap table, and reading and updating it |
| follow normal the MVCC rules. Therefore, when a transaction has a |
| snapshot that was taken before the inserting transaction committed, |
| that transaction will still see the old EOF value, and therefore also |
| ignore the newly-inserted data. |
| |
| |
| # Visibility map table |
| |
| The visibility map for each AO table is stored in a heap table named |
| "`pg_aoseg.pg_aovisimap_<oid>`". It is not to be confused with |
| the visibility map used for heaps in PostgreSQL 8.4 and above! |
| |
| The AO visibility map is used to implement `DELETEs` and `UPDATEs`. An |
| `UPDATE` in PostgreSQL is like `DELETE+INSERT`. In heap tables, the ctid |
| field is used to implement update-chain-following when updates are done |
| in `READ COMMITTED` mode, but AO tables don't store that information, so |
| an update of a recently updated row in read committed mode behaves as if |
| the row was deleted. |
| |
| The AO visibility map works as an overlay, over the data. When a row |
| is `DELETEd` from an AO table, the original tuple is not modified. Instead, |
| the tuple is marked as dead in the visibility map. |
| |
| The tuples in the visibility map follow the normal MVCC rules. When |
| a tuple is deleted, the visibility map tuple that covers that row |
| is updated, creating a new tuple in visibility map table, and |
| setting the old tuple's xmax. This implements the MVCC visibility |
| for deleted AO tables. Readers of the AO table read the visibility |
| map using the same snapshot as it would for a heap table. If a |
| deleted AO tuple is supposed to still be visible to the reader, the |
| reader will find and use the old visibility map tuple to determine |
| visibility, and it will therefore still see the old AO tuple as |
| visible. |
| |
| |
| # Block directory table |
| |
| The block directory is used to enable random access to an append-only |
| table. Normally, the blocks in an append-only table are read |
| sequentially, from beginning to end. To make index scans possible, a |
| map of blocks is stored in the block directory. The map can be used |
| to find the physical segment file and offset of the block, that contains |
| the AO tuple with given TID. So there is one extra level of indirection |
| with index scans on an AO table, compared to a heap table. |
| |
| The block directory is only created if it's needed, by the first |
| `CREATE INDEX` command on an AO table. |
| |
| |
| # TIDs and indexes |
| |
| Indexes, and much of the rest of the system, expect every tuple to |
| have a unique physical identifier, the Tuple Identifier, or `TID`. In a |
| heap table, it consists of the heap block number, and item number in |
| the page. An AO table uses a very different physical layout, |
| however. A row can be located by the combination of the segfile number |
| its in, and its row number within the segfile. Those two numbers are |
| mapped to the heap-style block+offset number, so that they look like |
| regular heap `TIDs`, and can be passed around the rest of the system, |
| and stored in indexes. See appendonlytid.h for details. |
| |
| |
| # Vacuum |
| |
| Append-only segment files are read-only after they're written, so |
| Vacuum cannot modify them either. Vacuum reads tuples from an existing |
| segment file, leaves out those tuples that are dead, and writes other |
| tuples back to a different segment file. Finally, the old segment file |
| is deleted. This is like `VACUUM FULL` on heap tables: new index entries |
| are created for every moved tuple. |
| |
| |
| # Locking and snapshots |
| |
| Scans lock the table using the lockmode determined by normal PostgreSQL |
| rules. A scan uses its snapshot to read the pg_aoseg table, and then |
| proceeds to read the segfiles that were visible to the snapshot. |
| |
| When an insert or update needs a target segfile to insert rows to, it |
| scans pg_aoseg table for a suitable segfile, and locks its pg_aoseg |
| tuple, using heap_lock_tuple(). This scan is made with a fresh catalog |
| snapshot (formerly SnapshotNow). To make this scan and locking free from |
| race conditions, it can be only performed by one segment at a time. For |
| that, the backend holds the relation extension lock while scanning |
| pg_aoseg for the target segment, but that lock is released as soon as |
| the pg_aoseg tuple is locked. |
| |
| Vacuum truncation (to release space used by aborted transactions) also |
| scans pg_aoseg with a fresh catalog snapshot, and holds the relation |
| extension lock. It skips over tuples that are locked, to not interfere |
| with inserters. |
| |
| Vacuum compaction locks both the segfile being compacted, and the |
| insertion target segment where the surviving rows are copied, the same |
| way that an insertion does. |
| |
| Vacuum drop phase, to recycle segments that have been compacted, |
| checks the xmin of each AWAITING_DROP segment. If it's visible to |
| everyone, the segfile is recycled. It uses the relation extension lock |
| to protect the scan over pg_aoseg. |
| |
| |
| # Unique indexes |
| |
| To answer uniqueness checks for AO/AOCO tables, we have a complication. Unlike |
| heap, in AO/CO we don't store the xmin/xmax fields in the tuples. So, we have to |
| rely on block directory rows that "cover" the data rows to satisfy index lookups. |
| Since the block directory is maintained as a heap table, visibility checks on it |
| are identical to any other heap table: the xmin/xmax of the block directory |
| row(s) will be leveraged. This means we don't have to write any special |
| visibility checking code ourselves, nor do we need to worry about transactions |
| vs subtransactions. |
| |
| Since block directory rows are written usually much after the data row has been |
| inserted, there are windows in which there is no block directory row on disk |
| for a given data row - a problem for concurrent unique index checks. So during |
| INSERT/COPY, at the beginning of the insertion operation, we insert a |
| placeholder block directory row to cover ALL future tuples going to the current |
| segment file for this command. |
| |
| To answer unique index lookups, we don't have to physically fetch the tuple from |
| the table. This is key to answering unique index lookups against placeholder |
| rows which predate their corresponding data rows. We simply perform a sysscan of |
| the block directory, and if we have a visible entry that encompasses the rowNum |
| being looked up, we go on to the next check. Otherwise, we have no conflict and |
| return. The next check that we need to perform is against the visimap, to see if |
| the tuple is visible. If yes, then we have a conflict. Since the snapshot used |
| to perform uniqueness checks for AO/CO is SNAPSHOT_DIRTY (we currently don't |
| support SNAPSHOT_SELF used for CREATE UNIQUE INDEX CONCURRENTLY), it is possible |
| to detect if the block directory tuple (and by extension the data tuple) was |
| inserted by a concurrent in-progress transaction. In this case, we simply avoid |
| the visimap check and return true. The benefit of performing the sysscan on the |
| block directory is that HeapTupleSatisfiesDirty() is called, and in the process, |
| the snapshot's xmin and/or xmax fields are updated (see SNAPSHOT_DIRTY for |
| details on its special contract). Returning true in this situation will lead to |
| the unique index code's xwait mechanism to kick in (see _bt_check_unique()) and |
| the current transaction will wait for the one that inserted the tuple to commit |
| or abort. |
| |
| Tableam changes: Since there is a lot of overhead (leads to ~20x performance |
| degradation in the worst case) in setting up and tearing down scan descriptors |
| for AO/CO tables, we avoid the scanbegin..fetch..scanend construct in |
| table_index_fetch_tuple_check(). |
| |
| So, a new tableam API index_unique_check() is used, which is implemented |
| only for AO/CO tables. Here, we fetch a UniqueCheckDesc, which stores all the |
| in-memory state to help us perform a unique index check. This descriptor is |
| attached to the DMLState structs. Currently, the descriptor holds only a block |
| directory struct. It will be modified later on to hold a visimap reference to |
| help implement DELETEs/UPDATEs. Furthermore, we initialize this struct on the |
| first unique index check performed, akin to how we initialize descriptors for |
| insert and delete. |
| |
| AO lazy VACUUM is different from heap vacuum in the sense that ctids of data |
| tuples change (and the index tuples need to be updated as a consequence). It |
| leverages the scan and insert code to scan live tuples from each segfile and to |
| move (insert) them in a target segfile. While moving tuples, we need to avoid |
| performing uniqueness checks from the insert machinery. This is to ensure that |
| we avoid spurious conflicts between the moved tuple and the original tuple. We |
| don't need to insert a placeholder row for the backend running vacuum as the old |
| index entries will still point to the segment being compacted. This will be the |
| case up until the index entries are bulk deleted, but by then the new index |
| entries along with new block directory rows would already have been written and |
| would be able to answer uniqueness checks. |
| |
| Transaction isolation: Since uniqueness checks utilize the special dirty |
| snapshot, these checks can cross transaction isolation boundaries. For instance, |
| let us consider what will happen if we are in a repeatable read transaction and |
| we insert a key that was inserted by a concurrent transaction. Further let's say |
| that the repeatable read transaction's snapshot was taken before the concurrent |
| transaction started. This means that the repeatable read transaction won't be |
| able to see the conflicting key (for eg. with a SELECT). In spite of that |
| conflicts will still be detected. Depending on whether the concurrent |
| transaction committed or is still in progress, the repeatable read transaction |
| will raise a conflict or enter into xwait respectively. This behavior is table |
| AM agnostic. |
| |
| Partial unique indexes: We don't have to do anything special for partial indexes. |
| Keys not satisfying the partial index predicate are never inserted into the |
| index, and hence there are no uniqueness checks triggered (see |
| ExecInsertIndexTuples()). Also during partial unique index builds, keys that |
| don't satisfy the partial index predicate are never inserted into the index |
| (see *_index_build_range_scan()). |