notes/delta-indexing-and-composition.txt - subversion - Git at Google

                          Deltified Storage
                          -----------------

 Mike and I are reworking the deltification/undeltification code right
 now, to be a lot more efficient than the first implementation.  We're
 using Branko's proposed scheme; below is his description of it, plus
 some related correspondence.

 =========================================================================
    Branko's original mail, edited for space but no content changed:
 =========================================================================

    Date: Fri, 27 Jul 2001 06:11:04 +0200
    From: Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu>
    Content-Type: text/plain; charset=ISO-8859-2; format=flowed
    Subject: RFC: Delta indexing and composition

    Here's the "more on that later" I promised.

    Now that I've safely postponed the delta combiner, I'd like to share my
    ideas for improving the deltification code in the filesystem.

    The scheme we have in place now works (kudos to Mike and Karl here!),
    but wastes both time and space, and  is probably a mild disaster for
    non-sequential access to large files or old versions. The reasons for
    this are obvious:

       1. It reads/resurrects the whole text from the beginning up to the
          end of the interesting region
       2. It keeps applying all the deltas from the version to the fulltext
          every time an old version is accessed, throwing away the result.

    My suggested solution to these two problems is based on recognizing
    exactly what a delta window is:

        A delta window defines a contiguous region of text A. It may depend
        on a contiguous region of text B, but is independent of any other
        delta window in the delta representation of A.

    Given this definition, two things become obvious:

       1. To reconstruct the region defined by window N, we only have to
          read that window.
       2. Different windows within a delta may depend on different source texts.

    The first item implies that we can solve the first problem my indexing
    the windows within a delta, and accessing them directly. But we knew
    that already.

    The second item implies that every time we reconstruct a region of text,
    we can replace its defining delta window with a single diff from the
    fulltext, eliminating the intermediate reconstruction steps the next
    time this region is accessed -- thus solving the second problem.

    So, here's my proposal

    1) Change the delta representation to index and store delta windows
    separately

            DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
           WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ;
           OFFSET ::= number ;
       REP-OFFSET ::= number;


    The REP-KEY and REP-OFFSET in WINDOW are optional because, if the
    differences between two file revisions is large enough, the diff could
    in fact be larger than a compression-only vdelta of the text region. In
    that case it makes more sense to compress the window than to store a diff.

    2) Change the undeltifier to use the new structure

    The undeltifier will stay essentially the same as it is now, except that
    it will use OFFSET and REP-OFFSET to access the necessary bits directly.
    The place where the delta combiner will fit stays the same, too.

    The major addition comes after the text is reconstructed. Using some
    suitable heuristic -- probably based on the number of jumps from the
    representation to the fulltext, the size of the diff from the fulltext,
    etc. -- we can decide to: a) replace the window with a single diff from
    the fulltext, b) replace it with a compressed version of the region, or
    c) do nothing.

    The disadvantage of this proposal is, of course, more space used in the
    repository. We can reduce the increase somewhat by compressing the
    window index, and possibly improving the svndiff encoding. But I think
    it's a fair price to pay, because this scheme reduces the number of disk
    accesses, total memory use and average processing time needed to
    reconstruct a region of text.

    That's it. Let's see you find holes in my proposal. :-)

        Brane

 =========================================================================
    Then Mike and I asked Branko some questions, here is his response:
 =========================================================================

    From: Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu>
    Subject: Re: deltification semi-rewrite starting now
    To: kfogel@collab.net
    CC: dev@subversion.tigris.org
    Date: Mon, 24 Sep 2001 23:45:48 +0200

    kfogel@collab.net wrote:
    >Mike Pilato and I have just reviewed Branko's deltification proposal,
    >found at
    >
    >   notes/delta-indexing-and-composition.txt
    >
    >and like what we see :-).  We have a couple of questions that probably
    >Branko can answer quickly, but basically we're going to start
    >implementing it now, completion anticipated in 2 weeks max (thank
    >goodness all the strings/reps separation is already done, so that
    >whole wheel doesn't need to be reinvented).
    >
    >The plan is that we'll also implement a new `svnadmin' subcommand for
    >deltifying and undeltifying revisions, or particular paths within
    >revisions.  That way, administrators have a way to make certain trees
    >very efficient to retrieve -- for example, one might want to do this
    >to a tagged release -- and also gives us an obvious way to deltify the
    >storage of the current svn repository without perturbing the revision
    >numbers. :-)
    >
    >Branko, a couple of questions regarding your lovely design:
    >
    >>So, here's my proposal
    >>
    >>1) Change the delta representation to index and store delta windows
    >>separately
    >>
    >>        DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
    >>       WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ;
    >>       OFFSET ::= number ;
    >>   REP-OFFSET ::= number;
    >>
    >>
    >>The REP-KEY and REP-OFFSET in WINDOW are optional because, if the
    >>differences between two file revisions is large enough, the diff could
    >>in fact be larger than a compression-only vdelta of the text region. In
    >>that case it makes more sense to compress the window than to store a diff.
    >>
    >
    >We're not sure what REP-OFFSET is for.
    >
    >We're pretty sure we understand OFFSET.  It's the offset into the
    >reconstructed fulltext.  The OFFSETs increase with each WINDOW in a
    >DELTA, and you can tell a given window's reconstruction range either
    >by adding OFFSET + SIZE, or by subtracting one OFFSET from the next.
    >
    >Hopefully that's a correct summary. :-)

    Yes, that is exactly right.

    >But what is REP-OFFSET?  We understand the REP-KEY that precedes it.
    >That's simply the representation against whose fulltext this delta
    >applies, right?

    Let me think ... Yes.

    >  But why would we want an offset into that rep?  We
    >had thought the relevant offset(s) are part of the svndiff encoding.
    >Is it a way of magically jumping over a certain number of windows and
    >landing on the right one, in next-most-immediate source
    >representation, or is it something else?

    Although the offset is implicit in the svndiff, in real life you want to
    find the source (fulltext) *before* decoding the window. Also, as I
    noted, you might want to just use a (self-referencing) vdelta compress
    instead of a diff, if the result of the compression is smaller than the
    diff.

    Hmm. It's been a long time since I wrote that, and as usual I left some
    of the reasoning out. I'll have to think about this again. I sort of
    remember it had to do with true random access to the text.

    >We're still thinking about this, but maybe you can put us out of our
    >misery quickly. :-)

    Thanks, you just got me worrying about it. :-)

    >Also, did you mean
    >
    >   WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET]) ;
    >
    >i.e., with parens, rather than without?  Yes, it would work without
    >being a sublist, but for maintainability a sublist might be
    >preferable...

    I meant without params, but obviously it doesn't hurt to make a sublist
    out of it. Use whatever you find more aesthetically pleasing. :-)

    >Anyway, we can start coding right away, while awaiting clarification.
    >Found no holes in the proposal; agree that there is a slight storage
    >penalty, but the memory usage and speed gains are so overwhelming that
    >it would be petty to complain about the *very* gently-sloped, albeit
    >linear, increase in storage per deltified file.

    Wonderful. Now I /really/ have to dust off and finish the delta combiner.

    >The replacing of distant diffs with ones nearer the fulltext is a
    >great idea; we'll probably wait on that until after the basic rewrite
    >is done, however, as it is an optimization, though a very effective
    >one.

    Yes, it's an optimization only. What's more, it can be done entirely
    off-line.

 =========================================================================
                           Commentary:
 =========================================================================

    We're going to just ignore REP-OFFSET for now, we can do everything
    without it.  Maybe it will be used in a true delta-combiner later.

    Also, yes, we'll wrap WINDOW in an extra pair of parens, purely for
    aesthetic reasons.  So:

            DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
           WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY [REP-OFFSET]]) ;
           OFFSET ::= number ;
       REP-OFFSET ::= number;
	Deltified Storage
	-----------------

	Mike and I are reworking the deltification/undeltification code right
	now, to be a lot more efficient than the first implementation. We're
	using Branko's proposed scheme; below is his description of it, plus
	some related correspondence.

	=========================================================================
	Branko's original mail, edited for space but no content changed:
	=========================================================================

	Date: Fri, 27 Jul 2001 06:11:04 +0200
	From: Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu>
	Content-Type: text/plain; charset=ISO-8859-2; format=flowed
	Subject: RFC: Delta indexing and composition

	Here's the "more on that later" I promised.

	Now that I've safely postponed the delta combiner, I'd like to share my
	ideas for improving the deltification code in the filesystem.

	The scheme we have in place now works (kudos to Mike and Karl here!),
	but wastes both time and space, and is probably a mild disaster for
	non-sequential access to large files or old versions. The reasons for
	this are obvious:

	1. It reads/resurrects the whole text from the beginning up to the
	end of the interesting region
	2. It keeps applying all the deltas from the version to the fulltext
	every time an old version is accessed, throwing away the result.

	My suggested solution to these two problems is based on recognizing
	exactly what a delta window is:

	A delta window defines a contiguous region of text A. It may depend
	on a contiguous region of text B, but is independent of any other
	delta window in the delta representation of A.

	Given this definition, two things become obvious:

	1. To reconstruct the region defined by window N, we only have to
	read that window.
	2. Different windows within a delta may depend on different source texts.

	The first item implies that we can solve the first problem my indexing
	the windows within a delta, and accessing them directly. But we knew
	that already.

	The second item implies that every time we reconstruct a region of text,
	we can replace its defining delta window with a single diff from the
	fulltext, eliminating the intermediate reconstruction steps the next
	time this region is accessed -- thus solving the second problem.

	So, here's my proposal

	1) Change the delta representation to index and store delta windows
	separately

	DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
	WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ;
	OFFSET ::= number ;
	REP-OFFSET ::= number;


	The REP-KEY and REP-OFFSET in WINDOW are optional because, if the
	differences between two file revisions is large enough, the diff could
	in fact be larger than a compression-only vdelta of the text region. In
	that case it makes more sense to compress the window than to store a diff.

	2) Change the undeltifier to use the new structure

	The undeltifier will stay essentially the same as it is now, except that
	it will use OFFSET and REP-OFFSET to access the necessary bits directly.
	The place where the delta combiner will fit stays the same, too.

	The major addition comes after the text is reconstructed. Using some
	suitable heuristic -- probably based on the number of jumps from the
	representation to the fulltext, the size of the diff from the fulltext,
	etc. -- we can decide to: a) replace the window with a single diff from
	the fulltext, b) replace it with a compressed version of the region, or
	c) do nothing.

	The disadvantage of this proposal is, of course, more space used in the
	repository. We can reduce the increase somewhat by compressing the
	window index, and possibly improving the svndiff encoding. But I think
	it's a fair price to pay, because this scheme reduces the number of disk
	accesses, total memory use and average processing time needed to
	reconstruct a region of text.

	That's it. Let's see you find holes in my proposal. :-)

	Brane

	=========================================================================
	Then Mike and I asked Branko some questions, here is his response:
	=========================================================================

	From: Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu>
	Subject: Re: deltification semi-rewrite starting now
	To: kfogel@collab.net
	CC: dev@subversion.tigris.org
	Date: Mon, 24 Sep 2001 23:45:48 +0200

	kfogel@collab.net wrote:
	>Mike Pilato and I have just reviewed Branko's deltification proposal,
	>found at
	>
	> notes/delta-indexing-and-composition.txt
	>
	>and like what we see :-). We have a couple of questions that probably
	>Branko can answer quickly, but basically we're going to start
	>implementing it now, completion anticipated in 2 weeks max (thank
	>goodness all the strings/reps separation is already done, so that
	>whole wheel doesn't need to be reinvented).
	>
	>The plan is that we'll also implement a new `svnadmin' subcommand for
	>deltifying and undeltifying revisions, or particular paths within
	>revisions. That way, administrators have a way to make certain trees
	>very efficient to retrieve -- for example, one might want to do this
	>to a tagged release -- and also gives us an obvious way to deltify the
	>storage of the current svn repository without perturbing the revision
	>numbers. :-)
	>
	>Branko, a couple of questions regarding your lovely design:
	>
	>>So, here's my proposal
	>>
	>>1) Change the delta representation to index and store delta windows
	>>separately
	>>
	>> DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
	>> WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ;
	>> OFFSET ::= number ;
	>> REP-OFFSET ::= number;
	>>
	>>
	>>The REP-KEY and REP-OFFSET in WINDOW are optional because, if the
	>>differences between two file revisions is large enough, the diff could
	>>in fact be larger than a compression-only vdelta of the text region. In
	>>that case it makes more sense to compress the window than to store a diff.
	>>
	>
	>We're not sure what REP-OFFSET is for.
	>
	>We're pretty sure we understand OFFSET. It's the offset into the
	>reconstructed fulltext. The OFFSETs increase with each WINDOW in a
	>DELTA, and you can tell a given window's reconstruction range either
	>by adding OFFSET + SIZE, or by subtracting one OFFSET from the next.
	>
	>Hopefully that's a correct summary. :-)

	Yes, that is exactly right.

	>But what is REP-OFFSET? We understand the REP-KEY that precedes it.
	>That's simply the representation against whose fulltext this delta
	>applies, right?

	Let me think ... Yes.

	> But why would we want an offset into that rep? We
	>had thought the relevant offset(s) are part of the svndiff encoding.
	>Is it a way of magically jumping over a certain number of windows and
	>landing on the right one, in next-most-immediate source
	>representation, or is it something else?

	Although the offset is implicit in the svndiff, in real life you want to
	find the source (fulltext) before decoding the window. Also, as I
	noted, you might want to just use a (self-referencing) vdelta compress
	instead of a diff, if the result of the compression is smaller than the
	diff.

	Hmm. It's been a long time since I wrote that, and as usual I left some
	of the reasoning out. I'll have to think about this again. I sort of
	remember it had to do with true random access to the text.

	>We're still thinking about this, but maybe you can put us out of our
	>misery quickly. :-)

	Thanks, you just got me worrying about it. :-)

	>Also, did you mean
	>
	> WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET]) ;
	>
	>i.e., with parens, rather than without? Yes, it would work without
	>being a sublist, but for maintainability a sublist might be
	>preferable...

	I meant without params, but obviously it doesn't hurt to make a sublist
	out of it. Use whatever you find more aesthetically pleasing. :-)

	>Anyway, we can start coding right away, while awaiting clarification.
	>Found no holes in the proposal; agree that there is a slight storage
	>penalty, but the memory usage and speed gains are so overwhelming that
	>it would be petty to complain about the very gently-sloped, albeit
	>linear, increase in storage per deltified file.

	Wonderful. Now I /really/ have to dust off and finish the delta combiner.

	>The replacing of distant diffs with ones nearer the fulltext is a
	>great idea; we'll probably wait on that until after the basic rewrite
	>is done, however, as it is an optimization, though a very effective
	>one.

	Yes, it's an optimization only. What's more, it can be done entirely
	off-line.

	=========================================================================
	Commentary:
	=========================================================================

	We're going to just ignore REP-OFFSET for now, we can do everything
	without it. Maybe it will be used in a true delta-combiner later.

	Also, yes, we'll wrap WINDOW in an extra pair of parens, purely for
	aesthetic reasons. So:

	DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
	WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY [REP-OFFSET]]) ;
	OFFSET ::= number ;
	REP-OFFSET ::= number;