This document explains the policy of performing a compaction. For details explaining how compactions are implemented, see compaction.md.

The compaction policy is responsible for selecting a set of rowsets to compact together. Compactions are necessary in order to reduce the number of DiskRowSets which must be consulted for various operations, thus improving the overall performance of the tablet.

Coming up with a good compaction policy is a balancing act between several goals:

Re-arrange the physical layout to be more efficient for subsequent operations.

Do so without using too many resources in the compaction itself.

Do so “smoothly” - spread work out over time so that operation performance is predictable and reasonably constant.

The following sections provide some analysis of the above goals:

In order to determine a good compaction policy, we want to define a cost measure for a given set of RowSets within a tablet. Consider the following set of RowSets:

1 2 3 4 5 |--A--||-B--||--C--||---D----| |--------------E-------------| |-F--|

In this diagram, the key space spans from left to right, and each RowSet is drawn as an interval based on its first and last contained key. We'll define a few terms for later use in this document:

**Width**

Let the Width of a RowSet be proportional to the percentage of key space that it spans. For example, rowset E has a width of 1, since it spans the whole tablet. Rowset B has width 0.2, since it spans about 20% of the tablet.

Note that the Width is also the probability that any read in a uniform random read workload will have to consult that RowSet.

**Height**

The “Height” of a tablet at a given key is the number of rowsets whose key ranges contain that key. For example, the height of the above tablet at key 1 is 2, since rowsets A and E span that key. The height at key 4 is 3, since D, E, and F span that key.

The Height at any key is the number of RowSets that will be have to be consulted for a random read of that key.

Let us consider the cost of various operations on the tablet:

In order to Insert, each of the rowsets must be checked for a duplicate key. By storing the rowset ranges in an interval tree, we can efficiently determine the set of rowsets whose intervals may contain the key to be inserted, and thus the cost is linear in that number of rowsets:

Let n = the Height of the tablet at the given key Let B = the bloom filter false positive rate Let C_bf = cost of bloom filter check Let C_pk = cost of a primary key lookup Cost = n*C_bf + n*B*C_pk Cost = n(C_bf + B*C_pk)

Typically, B is approximately 1% or lower, so the bloom filter checks dominate this equation. However, in some cases where the primary key column is very large, every primary key check will incur a disk seek, meaning that `C_pk`

is orders of magnitude higher than `C_bf`

(which we expect to be in RAM or SSD). So, we cannot fully ignore the term resulting from the bloom filter misses.

The costs for random read are similar to the cost for inserts: given the known key, each potentially overlapping rowset must be queried.

Scans cannot make use of bloom filters, so the cost is similar to the above, except that all overlapping rowsets must be seeked by PK:

Cost = n*C_pk

We assume a “short” scan is one in which the sequential IO cost after finding the start key is small compared to the seek cost. (eg assuming a 10ms seek time, 1MB or less of sequential IO).

A long scan is likely to retrieve data from many rowsets. In this case, the size of the rowsets comes into play.

Let S = the number of MB in the scan Let B = the disk bandwidth (MB/sec) Let n = the number of rowsets accessed, as before

Assume that accessing each rowset costs 1 seek (same as `C_pk`

).

Cost = n*C_pk + S/B

To summarize the above, all of the costs of operations are heavily dependent on the number of rowsets which must be accessed. Therefore, to minimize cost, we should follow the following strategies:

In the case of point queries (inserts and random read/short scan), merge rowsets which overlap in keyspace, thus reducing the average height of the Tablet.

In the case of longer scans, merge together rowsets to improve the ratio of sequential IO to seeks.

We can assume that, so long as the rowsets are reasonably large, goal #2 above has diminishing returns after rowsets achieve ~10MB or so of sequential IO for every seek (1 seek ~= 10ms, 10MB IO ~= 100ms). However, goal #1 has linear returns, so we focus on goal #1.

According to the above analysis, the optimal configuration for a tablet is a single giant rowset which spans the entirety of the key space. This is intuitively true: a fully-compacted tablet is going to perform the best because every access will require at most one bloom filter check and one seek.

However, it is obviously not optimal to simply compact all RowSets together in every compaction. This would be inefficient, since every compaction would rewrite the entire rowset, causing huge write amplification and wasted IO for only a small amount of efficiency gain.

So, we need to consider not just how efficient the resulting tablet would be, but also how expensive it is to perform the candidate compaction. Only by weighing those two against each other can we decide on the best compaction to perform at any given point in time.

For the purposes of this analysis, we consider the cost of a compaction to simply be the sum of the IO performed by the compaction. We'll assume that deletions are rare, in which case the output data size of a compaction is approximately equal to the input data size. We also assume that the compaction inputs are large enough that sequential IO outweighs any seeks required.

Thus the cost of performing a compaction is O(input size).

The third goal for compaction is to be able to perform work incrementally. Doing frequent incremental compactions rather than occasional large ones results in a more consistent performance profile for end-user applications. Incremental work also allows the system to react more quickly to changes in workload: for example, if one area of the keyspace becomes hot, we would like to be able to quickly react and compact that area of the keyspace within a short time window.

One way to achieve this goal is to put a bound on the amount of data that any given compaction will read and write. Bounding this data on the range of several hundred MB means that a compaction can occur in 10 seconds or less, allowing quick reaction time to shifts in workload.

The first key piece of the proposed compaction strategy is to limit the maximum size of any RowSet to a relatively small footprint - e.g 64MB or even less. This can be done by modifying the DiskRowSet writer code to “roll over” to a new rowset after the size threshold has been reached. Thus, even if flushing a larger dataset from memory, the on-disk rowset sizes can be limited.

For example, imagine that the max rowset size is set to 64MB, and 150MB of data has accumulated in the MemRowSet before a flush. The resulting output of the flush, then looks like:

A B C |------||------||--| 64MB 64MB 22MB

Note that even though the maximum DiskRowSet size is 64MB, the third flushed rowset will be smaller. In the future, we could esimate the on-disk data size and try to make the three RowSets approximately equal-sized, but it is not necessary for correctness.

Now imagine another scenario, where a Tablet flushes several times, each resulting in small files which span the entirety of the key space -- commonly seen in a uniform random insert load. After 3 flushes, the Tablet looks like:

A (50MB) |-------------------| B (50MB) |-------------------| C (50MB) |-------------------|

Because the three rowset ranges overlap, every access to the tablet must query each of the rowsets (i.e the average rowset “depth” is 3). If the compaction policy selects these three RowSets for compaction, the compaction result will look like:

D E F |------||------||--| 64MB 64MB 22MB

Essentially, the compaction reorganizes the data from overlapping rowsets into non-overlapping rowsets of a similar size. This reduces the average depth from 3 to 1, improving the Tablet performance.

With these limited sizes, a modestly sized Tablet (eg 20GB) will have on the order of hundreds of RowSets. In order to efficiently determine the set of RowSets which may contain a given query key or range, we have to change the Tablet code to store the RowSets in an interval tree instead of a simple list. The Interval Tree is a data structure which provides efficient query for the set of intervals overlapping a given query point or query interval.

As a simplification, assume for now that all RowSets are exactly the same size (rather than bounded under a maximum). Then, we can classify a RowSet as “good” or “bad” based on one simple factor: the smaller the range of key space that it spans, the better. Assuming a uniform insert workload, every flushed RowSet will span the entirety of the Tablet's key space -- and hence must be queried by every subsequent operation. Once there are multiple such flushed RowSets (A, B, and C in the diagram), compacting them results in skinnier rowsets D, E, and F.

Intuitively, then, a good compaction policy finds rowsets which are wide and overlapping, and compacts them together, resulting in rowsets which are skinny and non-overlapping.

Taking the cost factors developed above, we can look at compaction selection as an optimization problem: reduce the cost of the Tablet configuration as much as possible under a given IO budget.

Per the analysis above, the cost of a single read or insert is linear in the “height” of the RowSets at the key being accessed. So, the average cost of operations can be calculated by integrating the tablet height across the key space, or equivalently adding up the widths of all of the RowSets. For example:

|---A----| (width 10) |-----B-------| (width 15) |-C-||-----D-------| (width 5, width 15) |--------E---------| (width 20)

So, the summed width = 20+5+15+15+10 = 65.

Imagine that we choose to compact rowsets A, B, and D above, resulting in the following output:

|-C-||-F-||-G-||-H-| (width 5, width 5, width 5, width 5) |--------E---------| (width 20)

Note that the total number of bytes have not changed: we've just reorganized the bytes into a more compact form, reducing the average height of the tablet.

Now the summed cost is 40. So, the compaction had benefit 25, using a budget of 3 units of IO (remember that rowsets are assumed to be constant size for this analysis).

Another choice for the compaction might have been to compact B, D, and E, resulting in:

|---A----| (width 10) |-C-| (width 5) |---F--||--G--||-H-| (width 8, width 7, width 5)

This compaction reduced the tablet cost from 65 to 35 -- so its benefit was 30, using the same IO budget of 3.

Given that the second compaction choice reduced the tablet height more using the same budget, it is a more optimal solution.

The reduction of cost due to a compaction is simple to calculate:

Cost change = sum(original rowset widths) - sum(output rowset widths)

We know that the output rowsets will not overlap at all, and that their total width will span the union of the input rowset ranges. Therefore:

Cost change = sum(original rowset widths) - (union width of original rowsets)

Note that, for this analysis, the key ranges are treated as integers. This can be extended to string keys in a straightforward manner by treating the string data as unsigned integers.

Given budget N rowsets:

For each pair of rowsets (A, B): Evaluate BestForPair(A, B): BestForPair(A, B): Let union width = max(A.max_key, B.max_key) - min(A.min_key, B.min_key) Determine the subset R of rowsets that are fully contained within the range A, B Evaluate PickRowsetsWithBudget(R, N): Set objective = sum(rowset width) - union width If objective > best objective: best solution = this set PickRowsetsWithBudget(R, N): Choose the N rowsets in R which which maximize sum(rowset width)

PickRowsetsWithBudget can be solved by simply sorting the rowsets by their width and choosing the top N.

Even though we limit the maximum rowset size to a constant, some rowsets may be smaller due to more frequent flushes, etc. Thus, we would like to change the budget to be a number of MB of IO, rather than a simple count N of input files. The subproblem PickNRowSets then becomes:

Choose a set of RowSets such that their total file size falls within a budget, and maximizes their total widths.

This is an instance of the 0-1 knapsack problem, so we replace PickRowsetsWithBudget(R, N) with a knapsack problem solver.

The algorithm contains `O(n^2)`

calls to BestForPair, each of which contains one instance of the 0-1 knapsack problem, which has complexity `O(n * max_budget)`

. Thus, the total complexity is cubic in the number of rowsets, which can become quite expensive when a given tablet may include on the order of a thousand rowsets.

We can optimize the approach by changing the order in which we consider pairs (A, B) in the above-described algorithm:

For each rowset A: candidates = all rowsets B such that B.min_key >= A.min_key sort candidates B by increasing B.max For each pair (A, B): Evaluate BestForPair(A, B)

Considering the pairs in this order simplifies BestForPair as follows:

BestForPair(A, B): Let union width = max(A.max_key, b.max_key) - min(A.min_key, B.min_key) Determine the subset R of rowsets that are fully contained within the range A, B ** Because B.max_key is non_decreasing, this subset R is identical to R in the previous call, except that B is now added to the end. No extra loop is required. Evaluate PickRowsetsWithBudget(R, N): ** This instantiation of the knapsack problem now is identical to the previous instantiation, except with one additional item. Thus, it can be computed incrementally from the previous solution. Set objective = sum(rowset width) - union width If objective > best objective: best solution = this set

Additionally, upper bounds can be calculated by solving the simpler fractional knapsack problem and used to short-circuit the more complex calculations.

The above analysis is done in terms of constant workloads. However, in practice, workloads may be skewed. Given that, it is more important to compact the areas of the key space which are seeing frequent access. The algorithms can be extended in a straightforward way by changing all references to the “width” of a rowset to instead be CDF(max key) - CDF(min key) where CDF is the cumulative distribution function for accesses over a lagging time window.