| Skip-Deltas in Subversion |
| ========================= |
| |
| To keep repositories at a manageable size, essentially all version |
| control systems use some kind of relative compression technique such |
| that two similar versions of the same file don't take up much more |
| space than just one version of that file. The two most common |
| techniques are the SCCS "weave", which represents all revisions of a |
| file as a single data stream with the moral equivalent of #ifdefs, and |
| the technique of storing deltas (differences) between related |
| revisions of files (see |
| http://web.mit.edu/ghudson/thoughts/file-versioning for details). |
| Subversion uses deltas. |
| |
| Subversion uses a technique called "skip-deltas" to ensure that only a |
| reasonable number of deltas need to be composed in order to retrieve |
| any revisions of a file. The concept of skip-deltas is inspired by |
| the concept of skip-lists, but they aren't all that similar. |
| |
| For the purposes of this document, we will pretend that revisions of a |
| file are numbered starting from 0. In reality, this number |
| corresponds to the "change count" field of a node-revision in each |
| filesystem back end. |
| |
| Skip-Deltas in the FSFS Back End |
| ================================ |
| |
| In the FSFS back end, each revision of a file is represented as a |
| delta against an older revision of the file. The first revision is |
| represented as a delta against the empty stream (i.e. it is |
| self-compressed). To reconstruct a revision of a file, the filesystem |
| code determines the chain of deltas leading back to revision 0, |
| composes them all together using the delta combiner, and applies the |
| resulting super-delta to the empty stream in order to reconstruct the |
| file contents. |
| |
| The most obvious strategy would be to choose revision N-1 as the delta |
| base for revision N. But even with the delta combiner, it would |
| become very slow to retrieve revision 1000 of a file if we had to |
| piece together 1000 deltas. So, to choose the delta base for revision |
| N, we write out N in binary and flip the rightmost bit whose value is |
| 1. For instance, if we are storing 54, we write it out in binary as |
| 110110, flip the last '1' bit to get 110100, and thus pick revision 52 |
| of the file as the delta base. A file with ten versions (numbered |
| 0-9) would have those versions represented as follows: |
| |
| 0 <- 1 2 <- 3 4 <- 5 6 <- 7 |
| 0 <------ 2 4 <------ 6 |
| 0 <---------------- 4 |
| 0 <------------------------------------ 8 <- 9 |
| |
| where "0 <- 1" means that the delta base for revision 1 is revision 0. |
| |
| Because we flip the rightmost '1' bit each time we pick a delta base, |
| at most lg(N) deltas are necessary to reconstruct revision N of a |
| file. |
| |
| Skip-deltas in the BDB Back End |
| =============================== |
| |
| In the BDB back end, each revision of a file is represented as a delta |
| against a newer revision of the file--the opposite of FSFS. The |
| newest version of a file is stored in plain text. Whereas in FSFS, we |
| add a new version of a file simply by picking a delta base and writing |
| out a delta, in BDB the process is more complicated: we write out the |
| new version of the file in plain text; then, after the commit is |
| confirmed, we go back and "redeltify" one or more older versions of |
| the file against the new one. |
| |
| The goal of the redeltification process is to produce the reverse of |
| the FSFS diagram: |
| |
| 0 ------------------------------------> 8 -> 9 |
| 4 ----------------> 8 |
| 2 ------> 4 6 ------> 8 |
| 1 -> 2 3 -> 4 5 -> 6 7 -> 8 |
| |
| To accomplish this, we write out the number in binary, count the |
| number of trailing zeros, and redeltify that number of ancestor |
| revisions plus 1. For instance, when we see revision 8, we write it |
| out as "1000", note that there are three trailing 0s, and resolve to |
| redeltify four ancestor revisions: the revisions one back, two back, |
| four back, and eight back. |
| |
| As it turns out, the above diagram is a fiction. To reduce overhead, |
| the BDB back end makes three compromises to the skip-delta scheme: |
| |
| * When storing file revisions 0-31, only the immediate predecessor |
| is redeltified. |
| |
| * We don't redeltify the ancestor revision which is two back from |
| the one we are storing. |
| |
| * We never redeltify revision 0 of a file. |
| |
| Despite these compromises, the asymptotic behavior of the BDB |
| skip-delta scheme is the same as the simpler FSFS one: O(lg(N)) deltas |
| are necessary to reconstruct any revision of a file which has had N |
| revisions. |
| |
| Skip-Deltas and Branches |
| ======================== |
| |
| If a file's history diverges because it is copied and the modified on |
| both branches, the behavior is as follows: |
| |
| * In FSFS, we choose delta bases just as we would if each branch |
| were an isolated linear path leading back to revision 0. For |
| instance, if a file has six revisions (0-5), then branches into |
| revisions 6-7 and revisions 6'-8', they would look like: |
| |
| 0 <- 1 2 <- 3 4 <- 5 6 <- 7 |
| 0 <------ 2 4 <------ 6 |
| 6' <- 7' |
| 0 <-------------------------------------- 8' |
| |
| * In BDB, we redeltify ancestor revisions just as we would if each |
| branch were an isolated linear path leading back to revision 0. |
| The result depends on the order of commits. If a file has four |
| revisions (0-3), then branches into revisions 4 and 4', then if 4 |
| was committed first and 4' was committed second, the result would |
| look like: |
| |
| 4 |
| 0 --------------------> 4' |
| 2 --------> 4' |
| 1 --> 2 3 --> 4' |
| |
| but if instead, 4 was committed second, the result would look |
| like: |
| |
| 4' |
| 0 --------------------> 4 |
| 2 --------> 4 |
| 1 --> 2 3 --> 4 |
| |
| Although this order dependency may be confusing to think about, |
| it causes no complexity in the code, and the O(lg(N)) asymptotic |
| behavior is clearly preserved. |
| |
| Note that in the BDB back end, a branched file has a separate |
| plain-text representation for each branch head, while in the FSFS back |
| end, that is not the case. |
| |
| Costs of Skip-Deltas |
| ==================== |
| |
| In most workloads, the delta for a file revision becomes larger if the |
| delta base is farther away--in terms of the diagrams, longer arrows |
| take up more space. In the worst case, where all changes to the file |
| are orthogonal to each other, a delta across N file revisions may be N |
| times as expensive as a delta across one revision. |
| |
| In either back end, the average number of revisions crossed by a delta |
| arrow is O(lg(N)), if the file has had N revisions. So we may assume |
| that in the worst case, skip-deltas incur an O(lg(N)) space penalty |
| while providing an O(N/lg(N)) time benefit. The practical space |
| penalty appears to be substantially less than O(lg(N)), because many |
| files have short histories and many changes are not orthogonal to each |
| other. |