| <!-- |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| Compaction Policy |
| ============================================================ |
| |
| This document explains the policy of performing a compaction. |
| For details explaining how compactions are implemented, see compaction.md. |
| |
| The compaction policy is responsible for selecting a set of rowsets to compact |
| together. Compactions are necessary in order to reduce the number of DiskRowSets |
| which must be consulted for various operations, thus improving the overall |
| performance of the tablet. |
| |
| Coming up with a good compaction policy is a balancing act between several goals: |
| |
| 1. Re-arrange the physical layout to be more efficient for subsequent operations. |
| |
| 2. Do so without using too many resources in the compaction itself. |
| |
| 3. Do so "smoothly" - spread work out over time so that operation performance is |
| predictable and reasonably constant. |
| |
| |
| The following sections provide some analysis of the above goals: |
| |
| Benefit of compaction for subsequent operations |
| ============================================================ |
| |
| In order to determine a good compaction policy, we want to define a cost measure |
| for a given set of RowSets within a tablet. Consider the following set of |
| RowSets: |
| |
| ``` |
| 1 2 3 4 5 |
| |--A--||-B--||--C--||---D----| |
| |--------------E-------------| |
| |-F--| |
| ``` |
| |
| In this diagram, the key space spans from left to right, and each RowSet is drawn |
| as an interval based on its first and last contained key. We'll define a few terms |
| for later use in this document: |
| |
| **Width** |
| |
| Let the Width of a RowSet be proportional to the percentage of key |
| space that it spans. For example, rowset E has a width of 1, since |
| it spans the whole tablet. Rowset B has width 0.2, since it spans |
| about 20% of the tablet. |
| |
| Note that the Width is also the probability that any read in a |
| uniform random read workload will have to consult that RowSet. |
| |
| **Height** |
| |
| The "Height" of a tablet at a given key is the number of rowsets |
| whose key ranges contain that key. For example, the height of the |
| above tablet at key 1 is 2, since rowsets A and E span that key. |
| The height at key 4 is 3, since D, E, and F span that key. |
| |
| The Height at any key is the number of RowSets that will be have to |
| be consulted for a random read of that key. |
| |
| Let us consider the cost of various operations on the tablet: |
| |
| Insert |
| ------- |
| In order to Insert, each of the rowsets must be checked for a duplicate key. By |
| storing the rowset ranges in an interval tree, we can efficiently determine the |
| set of rowsets whose intervals may contain the key to be inserted, and thus the |
| cost is linear in that number of rowsets: |
| |
| ``` |
| Let n = the Height of the tablet at the given key |
| Let B = the bloom filter false positive rate |
| Let C_bf = cost of bloom filter check |
| Let C_pk = cost of a primary key lookup |
| Cost = n*C_bf + n*B*C_pk |
| Cost = n(C_bf + B*C_pk) |
| ``` |
| |
| Typically, B is approximately 1% or lower, so the bloom filter checks dominate this |
| equation. However, in some cases where the primary key column is very large, every |
| primary key check will incur a disk seek, meaning that `C_pk` is orders of magnitude |
| higher than `C_bf` (which we expect to be in RAM or SSD). So, we cannot fully ignore |
| the term resulting from the bloom filter misses. |
| |
| Random read |
| ------------ |
| The costs for random read are similar to the cost for inserts: given the known key, |
| each potentially overlapping rowset must be queried. |
| |
| |
| Short Scan |
| ----------- |
| Scans cannot make use of bloom filters, so the cost is similar to the above, except |
| that all overlapping rowsets must be seeked by PK: |
| |
| ``` |
| Cost = n*C_pk |
| ``` |
| |
| We assume a "short" scan is one in which the sequential IO cost after finding the start |
| key is small compared to the seek cost. (eg assuming a 10ms seek time, 1MB or less of |
| sequential IO). |
| |
| |
| Long scan (e.g. full table scan): |
| --------------------------------- |
| A long scan is likely to retrieve data from many rowsets. In this case, the size |
| of the rowsets comes into play. |
| |
| Let S = the number of MB in the scan |
| Let B = the disk bandwidth (MB/sec) |
| Let n = the number of rowsets accessed, as before |
| |
| Assume that accessing each rowset costs 1 seek (same as `C_pk`). |
| |
| ``` |
| Cost = n*C_pk + S/B |
| ``` |
| |
| To summarize the above, all of the costs of operations are heavily dependent on the |
| number of rowsets which must be accessed. Therefore, to minimize cost, we should |
| follow the following strategies: |
| |
| 1. In the case of point queries (inserts and random read/short scan), merge |
| rowsets which overlap in keyspace, thus reducing the average height of the |
| Tablet. |
| |
| 2. In the case of longer scans, merge together rowsets to improve the ratio of |
| sequential IO to seeks. |
| |
| We can assume that, so long as the rowsets are reasonably large, goal #2 above has |
| diminishing returns after rowsets achieve ~10MB or so of sequential IO for every |
| seek (1 seek ~= 10ms, 10MB IO ~= 100ms). However, goal #1 has linear returns, so we |
| focus on goal #1. |
| |
| |
| Cost of doing a compaction |
| ============================================================ |
| According to the above analysis, the optimal configuration for a tablet is a |
| single giant rowset which spans the entirety of the key space. This is |
| intuitively true: a fully-compacted tablet is going to perform the best because |
| every access will require at most one bloom filter check and one seek. |
| |
| However, it is obviously not optimal to simply compact all RowSets together in every |
| compaction. This would be inefficient, since every compaction would rewrite the |
| entire rowset, causing huge write amplification and wasted IO for only a small |
| amount of efficiency gain. |
| |
| So, we need to consider not just how efficient the resulting tablet would be, but also |
| how expensive it is to perform the candidate compaction. Only by weighing those two |
| against each other can we decide on the best compaction to perform at any given point |
| in time. |
| |
| For the purposes of this analysis, we consider the cost of a compaction to simply be |
| the sum of the IO performed by the compaction. We'll assume that deletions are rare, |
| in which case the output data size of a compaction is approximately equal to the |
| input data size. We also assume that the compaction inputs are large enough that |
| sequential IO outweighs any seeks required. |
| |
| Thus the cost of performing a compaction is O(input size). |
| |
| |
| Incremental work |
| ============================================================ |
| The third goal for compaction is to be able to perform work incrementally. Doing |
| frequent incremental compactions rather than occasional large ones results in a |
| more consistent performance profile for end-user applications. Incremental work |
| also allows the system to react more quickly to changes in workload: for example, |
| if one area of the keyspace becomes hot, we would like to be able to quickly |
| react and compact that area of the keyspace within a short time window. |
| |
| One way to achieve this goal is to put a bound on the amount of data that any |
| given compaction will read and write. Bounding this data on the range of several |
| hundred MB means that a compaction can occur in 10 seconds or less, allowing |
| quick reaction time to shifts in workload. |
| |
| |
| Proposed strategy: |
| ============================================================ |
| |
| Limiting RowSet Sizes |
| ------------------------------ |
| The first key piece of the proposed compaction strategy is to limit the maximum size of |
| any RowSet to a relatively small footprint - e.g 64MB or even less. This can be done |
| by modifying the DiskRowSet writer code to "roll over" to a new rowset after the size |
| threshold has been reached. Thus, even if flushing a larger dataset from memory, the |
| on-disk rowset sizes can be limited. |
| |
| |
| Flushes with limited RowSet size |
| --------------------------------- |
| For example, imagine that the max rowset size is set to 64MB, and 150MB of data has |
| accumulated in the MemRowSet before a flush. The resulting output of the flush, then |
| looks like: |
| |
| ``` |
| A B C |
| |------||------||--| |
| 64MB 64MB 22MB |
| ``` |
| |
| Note that even though the maximum DiskRowSet size is 64MB, the third flushed rowset |
| will be smaller. In the future, we could esimate the on-disk data size and try to make |
| the three RowSets approximately equal-sized, but it is not necessary for correctness. |
| |
| Compactions with limited RowSet size |
| ------------------------------------- |
| Now imagine another scenario, where a Tablet flushes several times, each resulting in |
| small files which span the entirety of the key space -- commonly seen in a uniform |
| random insert load. After 3 flushes, the Tablet looks like: |
| |
| ``` |
| A (50MB) |
| |-------------------| |
| B (50MB) |
| |-------------------| |
| C (50MB) |
| |-------------------| |
| ``` |
| |
| Because the three rowset ranges overlap, every access to the tablet must query each of the |
| rowsets (i.e the average rowset "depth" is 3). If the compaction policy selects these |
| three RowSets for compaction, the compaction result will look like: |
| |
| ``` |
| D E F |
| |------||------||--| |
| 64MB 64MB 22MB |
| ``` |
| |
| Essentially, the compaction reorganizes the data from overlapping rowsets into non-overlapping |
| rowsets of a similar size. This reduces the average depth from 3 to 1, improving the |
| Tablet performance. |
| |
| |
| Dealing with large numbers of RowSets |
| -------------------------------------- |
| With these limited sizes, a modestly sized Tablet (eg 20GB) will have on the order of hundreds |
| of RowSets. In order to efficiently determine the set of RowSets which may contain a given |
| query key or range, we have to change the Tablet code to store the RowSets in an interval |
| tree instead of a simple list. The Interval Tree is a data structure which provides efficient |
| query for the set of intervals overlapping a given query point or query interval. |
| |
| |
| Intuition behind compaction selection policy |
| --------------------------------------------- |
| As a simplification, assume for now that all RowSets are exactly the same size (rather |
| than bounded under a maximum). Then, we can classify a RowSet as "good" or "bad" based on |
| one simple factor: the smaller the range of key space that it spans, the better. |
| Assuming a uniform insert workload, every flushed RowSet will span the entirety of the |
| Tablet's key space -- and hence must be queried by every subsequent operation. Once there |
| are multiple such flushed RowSets (A, B, and C in the diagram), compacting them results in |
| skinnier rowsets D, E, and F. |
| |
| Intuitively, then, a good compaction policy finds rowsets which are wide and overlapping, and |
| compacts them together, resulting in rowsets which are skinny and non-overlapping. |
| |
| Taking the cost factors developed above, we can look at compaction selection as an optimization |
| problem: reduce the cost of the Tablet configuration as much as possible under a given IO budget. |
| |
| Per the analysis above, the cost of a single read or insert is linear in the "height" of the |
| RowSets at the key being accessed. So, the average cost of operations can be calculated by |
| integrating the tablet height across the key space, or equivalently adding up the widths |
| of all of the RowSets. For example: |
| |
| ``` |
| |---A----| (width 10) |
| |-----B-------| (width 15) |
| |-C-||-----D-------| (width 5, width 15) |
| |--------E---------| (width 20) |
| ``` |
| |
| So, the summed width = 20+5+15+15+10 = 65. |
| |
| Imagine that we choose to compact rowsets A, B, and D above, resulting in the following |
| output: |
| |
| ``` |
| |-C-||-F-||-G-||-H-| (width 5, width 5, width 5, width 5) |
| |--------E---------| (width 20) |
| ``` |
| |
| Note that the total number of bytes have not changed: we've just reorganized the bytes |
| into a more compact form, reducing the average height of the tablet. |
| |
| Now the summed cost is 40. So, the compaction had benefit 25, using a budget of 3 units of IO |
| (remember that rowsets are assumed to be constant size for this analysis). |
| |
| Another choice for the compaction might have been to compact B, D, and E, resulting in: |
| |
| ``` |
| |---A----| (width 10) |
| |-C-| (width 5) |
| |---F--||--G--||-H-| (width 8, width 7, width 5) |
| ``` |
| |
| This compaction reduced the tablet cost from 65 to 35 -- so its benefit was 30, using the same |
| IO budget of 3. |
| |
| Given that the second compaction choice reduced the tablet height more using the same budget, |
| it is a more optimal solution. |
| |
| Mathematical analysis |
| ----------------------- |
| The reduction of cost due to a compaction is simple to calculate: |
| |
| Cost change = sum(original rowset widths) - sum(output rowset widths) |
| |
| We know that the output rowsets will not overlap at all, and that their total width will |
| span the union of the input rowset ranges. Therefore: |
| |
| Cost change = sum(original rowset widths) - (union width of original rowsets) |
| |
| Note that, for this analysis, the key ranges are treated as integers. This can be extended |
| to string keys in a straightforward manner by treating the string data as unsigned integers. |
| |
| Algorithm |
| ---------- |
| |
| Given budget N rowsets: |
| |
| ``` |
| For each pair of rowsets (A, B): |
| Evaluate BestForPair(A, B): |
| |
| BestForPair(A, B): |
| Let union width = max(A.max_key, B.max_key) - min(A.min_key, B.min_key) |
| Determine the subset R of rowsets that are fully contained within the range A, B |
| Evaluate PickRowsetsWithBudget(R, N): |
| Set objective = sum(rowset width) - union width |
| If objective > best objective: |
| best solution = this set |
| |
| PickRowsetsWithBudget(R, N): |
| Choose the N rowsets in R which which maximize sum(rowset width) |
| ``` |
| |
| |
| PickRowsetsWithBudget can be solved by simply sorting the rowsets by their width and |
| choosing the top N. |
| |
| |
| Extending algorithm to non-constant sizes |
| ------------------------------------------ |
| |
| Even though we limit the maximum rowset size to a constant, some rowsets may be smaller |
| due to more frequent flushes, etc. Thus, we would like to change the budget to be a number |
| of MB of IO, rather than a simple count N of input files. The subproblem PickNRowSets then becomes: |
| |
| > Choose a set of RowSets such that their total file size falls within a budget, and |
| > maximizes their total widths. |
| |
| This is an instance of the 0-1 knapsack problem, so we replace PickRowsetsWithBudget(R, N) |
| with a knapsack problem solver. |
| |
| Computational complexity |
| ---------------------------- |
| |
| The algorithm contains `O(n^2)` calls to BestForPair, each of which contains one instance of the |
| 0-1 knapsack problem, which has complexity `O(n * max_budget)`. Thus, the total complexity is cubic |
| in the number of rowsets, which can become quite expensive when a given tablet may include on the |
| order of a thousand rowsets. |
| |
| We can optimize the approach by changing the order in which we consider pairs (A, B) in the |
| above-described algorithm: |
| |
| ``` |
| For each rowset A: |
| candidates = all rowsets B such that B.min_key >= A.min_key |
| sort candidates B by increasing B.max |
| For each pair (A, B): |
| Evaluate BestForPair(A, B) |
| ``` |
| |
| Considering the pairs in this order simplifies BestForPair as follows: |
| |
| ``` |
| BestForPair(A, B): |
| Let union width = max(A.max_key, b.max_key) - min(A.min_key, B.min_key) |
| Determine the subset R of rowsets that are fully contained within the range A, B |
| ** Because B.max_key is non_decreasing, this subset R is identical to R in the |
| previous call, except that B is now added to the end. No extra loop |
| is required. |
| Evaluate PickRowsetsWithBudget(R, N): |
| ** This instantiation of the knapsack problem now is identical to the previous |
| instantiation, except with one additional item. Thus, it can be computed |
| incrementally from the previous solution. |
| Set objective = sum(rowset width) - union width |
| If objective > best objective: |
| best solution = this set |
| ``` |
| |
| |
| Additionally, upper bounds can be calculated by solving the simpler fractional knapsack |
| problem and used to short-circuit the more complex calculations. |
| |
| |
| Extending algorithm to non-uniform workloads |
| -------------------------------------------- |
| |
| The above analysis is done in terms of constant workloads. However, in practice, workloads |
| may be skewed. Given that, it is more important to compact the areas of the key space which |
| are seeing frequent access. The algorithms can be extended in a straightforward way by changing |
| all references to the "width" of a rowset to instead be CDF(max key) - CDF(min key) where CDF |
| is the cumulative distribution function for accesses over a lagging time window. |