| Deltified Storage |
| ----------------- |
| |
| Mike and I are reworking the deltification/undeltification code right |
| now, to be a lot more efficient than the first implementation. We're |
| using Branko's proposed scheme; below is his description of it, plus |
| some related correspondence. |
| |
| ========================================================================= |
| Branko's original mail, edited for space but no content changed: |
| ========================================================================= |
| |
| Date: Fri, 27 Jul 2001 06:11:04 +0200 |
| From: Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu> |
| Content-Type: text/plain; charset=ISO-8859-2; format=flowed |
| Subject: RFC: Delta indexing and composition |
| |
| Here's the "more on that later" I promised. |
| |
| Now that I've safely postponed the delta combiner, I'd like to share my |
| ideas for improving the deltification code in the filesystem. |
| |
| The scheme we have in place now works (kudos to Mike and Karl here!), |
| but wastes both time and space, and is probably a mild disaster for |
| non-sequential access to large files or old versions. The reasons for |
| this are obvious: |
| |
| 1. It reads/resurrects the whole text from the beginning up to the |
| end of the interesting region |
| 2. It keeps applying all the deltas from the version to the fulltext |
| every time an old version is accessed, throwing away the result. |
| |
| My suggested solution to these two problems is based on recognizing |
| exactly what a delta window is: |
| |
| A delta window defines a contiguous region of text A. It may depend |
| on a contiguous region of text B, but is independent of any other |
| delta window in the delta representation of A. |
| |
| Given this definition, two things become obvious: |
| |
| 1. To reconstruct the region defined by window N, we only have to |
| read that window. |
| 2. Different windows within a delta may depend on different source texts. |
| |
| The first item implies that we can solve the first problem my indexing |
| the windows within a delta, and accessing them directly. But we knew |
| that already. |
| |
| The second item implies that every time we reconstruct a region of text, |
| we can replace its defining delta window with a single diff from the |
| fulltext, eliminating the intermediate reconstruction steps the next |
| time this region is accessed -- thus solving the second problem. |
| |
| So, here's my proposal |
| |
| 1) Change the delta representation to index and store delta windows |
| separately |
| |
| DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ; |
| WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ; |
| OFFSET ::= number ; |
| REP-OFFSET ::= number; |
| |
| |
| The REP-KEY and REP-OFFSET in WINDOW are optional because, if the |
| differences between two file revisions is large enough, the diff could |
| in fact be larger than a compression-only vdelta of the text region. In |
| that case it makes more sense to compress the window than to store a diff. |
| |
| 2) Change the undeltifier to use the new structure |
| |
| The undeltifier will stay essentially the same as it is now, except that |
| it will use OFFSET and REP-OFFSET to access the necessary bits directly. |
| The place where the delta combiner will fit stays the same, too. |
| |
| The major addition comes after the text is reconstructed. Using some |
| suitable heuristic -- probably based on the number of jumps from the |
| representation to the fulltext, the size of the diff from the fulltext, |
| etc. -- we can decide to: a) replace the window with a single diff from |
| the fulltext, b) replace it with a compressed version of the region, or |
| c) do nothing. |
| |
| The disadvantage of this proposal is, of course, more space used in the |
| repository. We can reduce the increase somewhat by compressing the |
| window index, and possibly improving the svndiff encoding. But I think |
| it's a fair price to pay, because this scheme reduces the number of disk |
| accesses, total memory use and average processing time needed to |
| reconstruct a region of text. |
| |
| That's it. Let's see you find holes in my proposal. :-) |
| |
| Brane |
| |
| ========================================================================= |
| Then Mike and I asked Branko some questions, here is his response: |
| ========================================================================= |
| |
| From: Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu> |
| Subject: Re: deltification semi-rewrite starting now |
| To: kfogel@collab.net |
| CC: dev@subversion.tigris.org |
| Date: Mon, 24 Sep 2001 23:45:48 +0200 |
| |
| kfogel@collab.net wrote: |
| >Mike Pilato and I have just reviewed Branko's deltification proposal, |
| >found at |
| > |
| > notes/delta-indexing-and-composition.txt |
| > |
| >and like what we see :-). We have a couple of questions that probably |
| >Branko can answer quickly, but basically we're going to start |
| >implementing it now, completion anticipated in 2 weeks max (thank |
| >goodness all the strings/reps separation is already done, so that |
| >whole wheel doesn't need to be reinvented). |
| > |
| >The plan is that we'll also implement a new `svnadmin' subcommand for |
| >deltifying and undeltifying revisions, or particular paths within |
| >revisions. That way, administrators have a way to make certain trees |
| >very efficient to retrieve -- for example, one might want to do this |
| >to a tagged release -- and also gives us an obvious way to deltify the |
| >storage of the current svn repository without perturbing the revision |
| >numbers. :-) |
| > |
| >Branko, a couple of questions regarding your lovely design: |
| > |
| >>So, here's my proposal |
| >> |
| >>1) Change the delta representation to index and store delta windows |
| >>separately |
| >> |
| >> DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ; |
| >> WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ; |
| >> OFFSET ::= number ; |
| >> REP-OFFSET ::= number; |
| >> |
| >> |
| >>The REP-KEY and REP-OFFSET in WINDOW are optional because, if the |
| >>differences between two file revisions is large enough, the diff could |
| >>in fact be larger than a compression-only vdelta of the text region. In |
| >>that case it makes more sense to compress the window than to store a diff. |
| >> |
| > |
| >We're not sure what REP-OFFSET is for. |
| > |
| >We're pretty sure we understand OFFSET. It's the offset into the |
| >reconstructed fulltext. The OFFSETs increase with each WINDOW in a |
| >DELTA, and you can tell a given window's reconstruction range either |
| >by adding OFFSET + SIZE, or by subtracting one OFFSET from the next. |
| > |
| >Hopefully that's a correct summary. :-) |
| |
| Yes, that is exactly right. |
| |
| >But what is REP-OFFSET? We understand the REP-KEY that precedes it. |
| >That's simply the representation against whose fulltext this delta |
| >applies, right? |
| |
| Let me think ... Yes. |
| |
| > But why would we want an offset into that rep? We |
| >had thought the relevant offset(s) are part of the svndiff encoding. |
| >Is it a way of magically jumping over a certain number of windows and |
| >landing on the right one, in next-most-immediate source |
| >representation, or is it something else? |
| |
| Although the offset is implicit in the svndiff, in real life you want to |
| find the source (fulltext) *before* decoding the window. Also, as I |
| noted, you might want to just use a (self-referencing) vdelta compress |
| instead of a diff, if the result of the compression is smaller than the |
| diff. |
| |
| Hmm. It's been a long time since I wrote that, and as usual I left some |
| of the reasoning out. I'll have to think about this again. I sort of |
| remember it had to do with true random access to the text. |
| |
| >We're still thinking about this, but maybe you can put us out of our |
| >misery quickly. :-) |
| |
| Thanks, you just got me worrying about it. :-) |
| |
| >Also, did you mean |
| > |
| > WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET]) ; |
| > |
| >i.e., with parens, rather than without? Yes, it would work without |
| >being a sublist, but for maintainability a sublist might be |
| >preferable... |
| |
| I meant without params, but obviously it doesn't hurt to make a sublist |
| out of it. Use whatever you find more aesthetically pleasing. :-) |
| |
| >Anyway, we can start coding right away, while awaiting clarification. |
| >Found no holes in the proposal; agree that there is a slight storage |
| >penalty, but the memory usage and speed gains are so overwhelming that |
| >it would be petty to complain about the *very* gently-sloped, albeit |
| >linear, increase in storage per deltified file. |
| |
| Wonderful. Now I /really/ have to dust off and finish the delta combiner. |
| |
| >The replacing of distant diffs with ones nearer the fulltext is a |
| >great idea; we'll probably wait on that until after the basic rewrite |
| >is done, however, as it is an optimization, though a very effective |
| >one. |
| |
| Yes, it's an optimization only. What's more, it can be done entirely |
| off-line. |
| |
| ========================================================================= |
| Commentary: |
| ========================================================================= |
| |
| We're going to just ignore REP-OFFSET for now, we can do everything |
| without it. Maybe it will be used in a true delta-combiner later. |
| |
| Also, yes, we'll wrap WINDOW in an extra pair of parens, purely for |
| aesthetic reasons. So: |
| |
| DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ; |
| WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY [REP-OFFSET]]) ; |
| OFFSET ::= number ; |
| REP-OFFSET ::= number; |