Compaction was moved to a separate module from the catalog module to eliminate circular dependencies, as it requires some components that may themselves depend on the catalog module. Please refer to the catalog's module readme for more information about catalog service and update log.
During schema changes, catalog update log stores incremental updates. Each update increases the catalog version. Over time, log may grow to a humongous size. To address this, snapshotting was introduced to UpdateLog. Snapshotting means replacing incremental updates with a snapshot.
But different components can refer to a specific version of the catalog. Until they finish their work with this version, it cannot be truncated.
This module introduces CatalogCompactionRunner component. This component handles periodical catalog compaction, ensuring that dropped versions of the catalog are no longer needed by any component in the cluster.
Compaction is performed from single node (compaction coordinator) that is also the metastorage group leader for simplicity. Therefore, when the metastorage group leader changes, the compaction coordinator also changes.
The ElectionListener interface was introduced to listen for metastorage leader elections.
The process is initiated by one of the following events:
low watermark
changed (every 5 minutes by default)Catalog compaction consists of two main stages:
Replicas update. Updates all replication groups with preset minimum begin time among all active read-write transactions in the cluster. After some time (see below for details) these timestamps are published and become available for the next phase.
Compaction. By using the timestamps published on the previous stage coordinator calculates the minimum required version of the catalog and performs compaction.
Publishing timestamps can take a long time, and the success of compaction depends on more than just these timestamps. That's why both stages run in parallel. Thus, the compaction stage uses the result of the replicas update calculated at one of the previous iterations. To minimize the number of network requests, both processes run simultaneously and use a common request to collect timestamps from the entire cluster in one round trip.
This stage consists of the following steps:
minTxTime
) is published (becomes available to compaction process) only after checkpoint happens (partition data is flushed to disk).minTxTime
) in local replication groups.minTxTime
is not published yet, the current iteration of compaction is aborted.minTxTime
and current low watermark
.