For the purpose of this document, “maintenance operations” are any background processes that Kudu runs in the course of normal operation. The MaintenanceManager must schedule these operations intelligently to keep the system operating smoothly. Partly, this is a tradeoff between current performance and future performance. For example, running a compaction will spend some I/O now in order to speed up insertions later. Partly, this is a matter of performing necessary tasks that, if left undone, would compromise the stability of the system. For example, if we never flushed MemRowSets, we would eventually run out of memory. As memory gets low, admissions control will slow the pace of new requests getting accepted.
The most important things that we need to weigh in order to make good decisions are:
Some other criteria that we considered, but rejected for v1 include:
Free disk space should not be an issue in most competently administered setups. We may revisit this later, but for the initial version, it is best to assume we have enough space.
We can‘t consider disk-based scheduling right now since we don’t have support for multiple disks yet.
Memory usage can be broken down into a few buckets:
We assume that #1 is relatively constant. The maintenance op scheduler can make tradeoffs between #2 and #3 by deciding to flush certain MemRowSets to disk.
We want to keep the total amount of memory held by #1, #2 and #3 from growing too large. For now, our goal is to keep this sum relatively constant. We have not yet implemented giving memory held by tcmalloc back to the operating system.
If we know that a tablet‘s workload is scan-heavy (rather than insert-heavy), we may wish to do a major delta compaction for that tablet to speed up scans. It’s probably smarter to do compactions on tables that are heavily used, than on obscure tables that don't see much traffic.
This is probably the most difficult information source to make effective use of, simply because it involves many workload-dependent assumptions and heuristics.
MemRowSet and DeltaMemRowSet objects must be flushed to disk when they get too old. If we don't do this, the write-ahead log (WAL) will grow without bound. This growth would waste disk space and slow startup to a crawl, since the entire WAL must be traversed during the startup process.
We should embed a WAL op id in each MemRowSets and DeltaMemRowSet. The scheduler will look more favorably on the flushing of a MemRowSet as it ages. After the operation id falls too far behind, it will try to flush the MemRowSet no matter what.
These operations spend some I/O or CPU in order to free up memory usage. They may also incur further performance costs after completion. These cannot be delayed indefinitely, as RAM is a finite resource.
Basically the same costs as MemStore flush
Additional benefits: TODO: flushing may also speed up scans substantially. Need to run experiments on this -- how much better is scanning a static cached file compared to scanning the equivalent memstore. Maybe an order of magnitude.
Cost: slower reads, slower inserts if evicting key columns or blooms Benefit: frees RAM
These operations expend some kind of I/O and CPU now in order to improve the performance of the system after they complete. They are only ever “necessary” in that if we put them off forever, the system will slow to a crawl eventually.
Each tablet creates several MaintenanceOp objects, representing the various maintenance operations which can be performed on it. It registers these operations with the MaintenanceManager.
The MaintenanceManager has a main thread which periodically polls the registered MaintenanceOp objects and determines whether it should execute any of them. The default polling interval is 250 ms, but this is configurable. Access to the MaintenanceOp is assumed to be thread-safe. It's important to note that the scheduler can choose any op available to it. It is not bound to execute operations on a first-come, first-serve basis.
If the MaintenanceManager decides to execute one of these operations, it will run it in a thread-pool of configurable size. We assume that maintenance operations are blocking and require a thread context. If the operation fails, the MaintenanceManager will log a warning message and re-trigger the main thread. The failed MaintenanceOp will not be retried until a configurable grace period has expired.
The MaintenanceOp has various fields indicating how much memory it will probably free, how much CPU it will use, and so forth. It also has a field which marks it as not currently executable. For example, this may be used by some Ops that don't want multiple instances of themselves to run concurrently.
We want to keep at least one thread free to run flush operations, so that we don‘t ever get into a situation where we need to free up memory, but all the maintenance op threads are working on compactions or other operations. Hopefully, most compactions will be reasonably short, so that we won’t have to schedule long compactions differently than short ones.