For historical reasons, append-optimized tables are also called “append-only”. They used to be truly append-only in previous versions of Cloudberry, but these days they can in fact be updated and deleted from. The segment files that store the tuples at the storage level are append-only - data is never in-place updated or overwritten physically - but from a user's point of view, the table can be updated and deleted from like a heap table. The AO visibility map and MVCC on the metadata tables make updates and deletes to work.
Append-Optimized, Column Storage tables (AOCS), follow a similar layout, but there is some extra metadata to track which segment file corresponds to which column. The source code lives in a separate directory, but was originally copy-pasted from append-only code, and the high-level layout of the files is the same.
The tuples in an append-optimized table are stored in a number of “segment files”, or “segfiles” for short. The segment files are stored in the data directory, in files name “<relfilenode>.<segno>”, just like heap tables. Unlike heap tables, each segment in an append-only table can have different size, and some of them can even be missing. They can also be larger than 1 GB, and are not expanded in BLCKSZ-sized blocks, like heap segments are. Segment zero is empty unless data has been inserted during utility mode, in which case it's inserted into segment zero. Extending the table by adding attributes via ALTER TABLE will also push data to segment zero. Each table can have at most 128 segment files.
AOCS tables can similarly have at most 128 segment files for each column. The range of segno is dependent on the filenum value in pg_attribute_encoding. segno 0,1-127 (filenum = 1), segno 128,129-255 (filenum = 2),... To find the file range for an AOCS table column, find the attnum of that column and then find the corresponding filenum for that attnum from pg_attribute_encoding.
An append-only segfile consists of a number of variable-sized blocks (“varblocks”), one after another. The varblocks are aligned to 4 bytes. Each block starts with a header. There are multiple kinds of blocks, with different header information. Each block type begins with common “basic” header information, however. The different header types are defined in cdbappendonlystorage_int.h. The header also contains two checksums, one for the header, and another for the whole block.
The “pg_appendonly” catalog table stores extra information for each AO table. You can think of it as an extension of pg_class.
In addition to the segment files that store the user data, there are three auxiliary heap tables for each AO table, which store metadata. The aosegments table is one of them, which stores helpful attributes such as modcount, which is a bearing on the number of DML operations that the AO table has been subject to.
The aosegments table is initially named: “pg_aoseg.pg_aoseg_<oid>”, where <oid> is the oid of the AO table. However, if certain DDL statements such as ALTER, that involve a rewrite of the AO table on disk, are applied, this aosegments table is replaced by a new table with a changed <oid> suffix. The complete process involves creation of a temporary table followed by an update of the relfilenode and aosegments table oid associated with the AO table.
In order to find the current aosegments table of an AO table, the “pg_catalog.pg_appendonly” catalog table must be queried. The segrelid column in this table yields the oid of the current aosegments table. If we now query pg_class with this oid, we will get the name of the current aosegments table. Please note that the segrelid is not equal to the <oid> suffix of the current aosegments table.
Aosegment tables are similar to the TOAST tables, for heaps. An append-only table can have a TOAST table, too, in addition to the AO-specific auxiliary tables.
The segment table lists all the segment files that exist. For each segfile, it stores the current length of the table. This is used to make MVCC work, and to make AO tables crash-safe. When an AO table is loaded, the loading process simply appends new blocks to the end of one of the segment files. Once the loading statement has completed, it updates the segment table with the new EOF of that segment file. If the transaction aborts, or the server crashes before end-of-transaction, the EOF stored in the aosegment table stays unchanged, and readers will ignore any data that was appended by the aborted transaction. The segments table is a regular heap table, and reading and updating it follow normal the MVCC rules. Therefore, when a transaction has a snapshot that was taken before the inserting transaction committed, that transaction will still see the old EOF value, and therefore also ignore the newly-inserted data.
The visibility map for each AO table is stored in a heap table named “pg_aoseg.pg_aovisimap_<oid>”. It is not to be confused with the visibility map used for heaps in PostgreSQL 8.4 and above!
The AO visibility map is used to implement DELETEs and UPDATEs. An UPDATE in PostgreSQL is like DELETE+INSERT. In heap tables, the ctid field is used to implement update-chain-following when updates are done in READ COMMITTED mode, but AO tables don't store that information, so an update of a recently updated row in read committed mode behaves as if the row was deleted.
The AO visibility map works as an overlay, over the data. When a row is DELETEd from an AO table, the original tuple is not modified. Instead, the tuple is marked as dead in the visibility map.
The tuples in the visibility map follow the normal MVCC rules. When a tuple is deleted, the visibility map tuple that covers that row is updated, creating a new tuple in visibility map table, and setting the old tuple's xmax. This implements the MVCC visibility for deleted AO tables. Readers of the AO table read the visibility map using the same snapshot as it would for a heap table. If a deleted AO tuple is supposed to still be visible to the reader, the reader will find and use the old visibility map tuple to determine visibility, and it will therefore still see the old AO tuple as visible.
The block directory is used to enable random access to an append-only table. Normally, the blocks in an append-only table are read sequentially, from beginning to end. To make index scans possible, a map of blocks is stored in the block directory. The map can be used to find the physical segment file and offset of the block, that contains the AO tuple with given TID. So there is one extra level of indirection with index scans on an AO table, compared to a heap table.
The block directory is only created if it's needed, by the first CREATE INDEX command on an AO table.
Indexes, and much of the rest of the system, expect every tuple to have a unique physical identifier, the Tuple Identifier, or TID. In a heap table, it consists of the heap block number, and item number in the page. An AO table uses a very different physical layout, however. A row can be located by the combination of the segfile number its in, and its row number within the segfile. Those two numbers are mapped to the heap-style block+offset number, so that they look like regular heap TIDs, and can be passed around the rest of the system, and stored in indexes. See appendonlytid.h for details.
Append-only segment files are read-only after they're written, so Vacuum cannot modify them either. Vacuum reads tuples from an existing segment file, leaves out those tuples that are dead, and writes other tuples back to a different segment file. Finally, the old segment file is deleted. This is like VACUUM FULL on heap tables: new index entries are created for every moved tuple.
Scans lock the table using the lockmode determined by normal PostgreSQL rules. A scan uses its snapshot to read the pg_aoseg table, and then proceeds to read the segfiles that were visible to the snapshot.
When an insert or update needs a target segfile to insert rows to, it scans pg_aoseg table for a suitable segfile, and locks its pg_aoseg tuple, using heap_lock_tuple(). This scan is made with a fresh catalog snapshot (formerly SnapshotNow). To make this scan and locking free from race conditions, it can be only performed by one segment at a time. For that, the backend holds the relation extension lock while scanning pg_aoseg for the target segment, but that lock is released as soon as the pg_aoseg tuple is locked.
Vacuum truncation (to release space used by aborted transactions) also scans pg_aoseg with a fresh catalog snapshot, and holds the relation extension lock. It skips over tuples that are locked, to not interfere with inserters.
Vacuum compaction locks both the segfile being compacted, and the insertion target segment where the surviving rows are copied, the same way that an insertion does.
Vacuum drop phase, to recycle segments that have been compacted, checks the xmin of each AWAITING_DROP segment. If it's visible to everyone, the segfile is recycled. It uses the relation extension lock to protect the scan over pg_aoseg.
To answer uniqueness checks for AO/AOCO tables, we have a complication. Unlike heap, in AO/CO we don‘t store the xmin/xmax fields in the tuples. So, we have to rely on block directory rows that “cover” the data rows to satisfy index lookups. Since the block directory is maintained as a heap table, visibility checks on it are identical to any other heap table: the xmin/xmax of the block directory row(s) will be leveraged. This means we don’t have to write any special visibility checking code ourselves, nor do we need to worry about transactions vs subtransactions.
Since block directory rows are written usually much after the data row has been inserted, there are windows in which there is no block directory row on disk for a given data row - a problem for concurrent unique index checks. So during INSERT/COPY, at the beginning of the insertion operation, we insert a placeholder block directory row to cover ALL future tuples going to the current segment file for this command.
To answer unique index lookups, we don‘t have to physically fetch the tuple from the table. This is key to answering unique index lookups against placeholder rows which predate their corresponding data rows. We simply perform a sysscan of the block directory, and if we have a visible entry that encompasses the rowNum being looked up, we go on to the next check. Otherwise, we have no conflict and return. The next check that we need to perform is against the visimap, to see if the tuple is visible. If yes, then we have a conflict. Since the snapshot used to perform uniqueness checks for AO/CO is SNAPSHOT_DIRTY (we currently don’t support SNAPSHOT_SELF used for CREATE UNIQUE INDEX CONCURRENTLY), it is possible to detect if the block directory tuple (and by extension the data tuple) was inserted by a concurrent in-progress transaction. In this case, we simply avoid the visimap check and return true. The benefit of performing the sysscan on the block directory is that HeapTupleSatisfiesDirty() is called, and in the process, the snapshot‘s xmin and/or xmax fields are updated (see SNAPSHOT_DIRTY for details on its special contract). Returning true in this situation will lead to the unique index code’s xwait mechanism to kick in (see _bt_check_unique()) and the current transaction will wait for the one that inserted the tuple to commit or abort.
Tableam changes: Since there is a lot of overhead (leads to ~20x performance degradation in the worst case) in setting up and tearing down scan descriptors for AO/CO tables, we avoid the scanbegin..fetch..scanend construct in table_index_fetch_tuple_check().
So, a new tableam API index_unique_check() is used, which is implemented only for AO/CO tables. Here, we fetch a UniqueCheckDesc, which stores all the in-memory state to help us perform a unique index check. This descriptor is attached to the DMLState structs. Currently, the descriptor holds only a block directory struct. It will be modified later on to hold a visimap reference to help implement DELETEs/UPDATEs. Furthermore, we initialize this struct on the first unique index check performed, akin to how we initialize descriptors for insert and delete.
AO lazy VACUUM is different from heap vacuum in the sense that ctids of data tuples change (and the index tuples need to be updated as a consequence). It leverages the scan and insert code to scan live tuples from each segfile and to move (insert) them in a target segfile. While moving tuples, we need to avoid performing uniqueness checks from the insert machinery. This is to ensure that we avoid spurious conflicts between the moved tuple and the original tuple. We don't need to insert a placeholder row for the backend running vacuum as the old index entries will still point to the segment being compacted. This will be the case up until the index entries are bulk deleted, but by then the new index entries along with new block directory rows would already have been written and would be able to answer uniqueness checks.
Transaction isolation: Since uniqueness checks utilize the special dirty snapshot, these checks can cross transaction isolation boundaries. For instance, let us consider what will happen if we are in a repeatable read transaction and we insert a key that was inserted by a concurrent transaction. Further let‘s say that the repeatable read transaction’s snapshot was taken before the concurrent transaction started. This means that the repeatable read transaction won't be able to see the conflicting key (for eg. with a SELECT). In spite of that conflicts will still be detected. Depending on whether the concurrent transaction committed or is still in progress, the repeatable read transaction will raise a conflict or enter into xwait respectively. This behavior is table AM agnostic.
Partial unique indexes: We don‘t have to do anything special for partial indexes. Keys not satisfying the partial index predicate are never inserted into the index, and hence there are no uniqueness checks triggered (see ExecInsertIndexTuples()). Also during partial unique index builds, keys that don’t satisfy the partial index predicate are never inserted into the index (see *_index_build_range_scan()).