blob: b2d16cdd8a7286b96a695676987a591310480af3 [file] [log] [blame]
@node Deltas
@chapter Deltas
Subversion uses three kinds of deltas:
@itemize @bullet
@item
A @b{@dfn{tree delta}} describes the difference between two arbitrary
directory trees, the way a traditional patch describes the difference
between two files. For example, the delta between directories A and B
could be applied to A, to produce B.
Tree deltas can also carry ancestry information, indicating how the
files in one tree are related to files in the other tree. And deltas
can describe changes to file meta-information, like permission bits,
creation dates, and so on. The repository and working copy use deltas
to communicate changes.
@item
A @b{@dfn{text delta}} describes changes to a string of bytes, such as the
contents of a file. It is analogous to traditional patch format, except
that it works equally well on binary and text files, and is not
invertible (because context and deleted data are not recorded).
@item
A @b{@dfn{property delta}} describes changes to a list of named
properties (@pxref{Properties}).
@end itemize
The term @dfn{delta} without qualification generally means a tree delta,
unless some other meaning is clear from context.
In the examples below, deltas will be described in XML, which happens to
be Subversion's import/export format. However, note that deltas are an
abstract data structure, of which the XML format is merely one
representation. Later, we will describe other representations: for
example, there is a serialized representation (useful for streaming
protocols, among other things), and a db-style representation, used for
repository storage. The various representations of a given delta are
(in theory, anyway) perfectly isomorphic to one another, since they
describe the same underlying structure.
@menu
* Text Deltas::
* Property Deltas::
* Tree Deltas::
* Postfix Text Deltas::
* Serializing Deltas via the "Editor" Interface::
@end menu
@c -----------------------------------------------------------------------
@node Text Deltas
@section Text Deltas
A text delta describes the difference between two strings of bytes, the
@dfn{source} string and the @dfn{target} string. Given a source string
and a target string, we can compute a text delta; given a source string
and a delta, we can reconstruct the target string. However, note that
deltas are not invertible: you cannot always reconstruct the source
string given the target string and delta.
The standard Unix ``diff'' format is one possible representation for
text deltas; however, diffs are not ideal for internal use by a revision
control system, for several reasons:
@itemize @bullet
@item
Diffs are line-oriented, which makes them human-readable, but sometimes
makes them perform poorly on binary files.
@item
Diffs represent a series of replacements, exchanging selected ranges of
the old text with new text; again, this is easy for humans to read, but
it is more expensive to compute and less compact than some alternatives.
@end itemize
Instead, Subversion uses the VDelta binary-diffing algorithm, as
described in @cite{Hunt, J. J., Vo, K.-P., and Tichy, W. F. An
empirical study of delta algorithms. Lecture Notes in Computer Science
1167 (July 1996), 49-66.} Currently, the output of this algorithm is
stored in a custom data format called @dfn{svndiff}, invented by Greg
Hudson <@email{ghudson@@mit.edu}>, a Subversion developer.
The concrete form of a text delta is a well-formed XML element, having
the following form:
@example
<text-delta>@var{data}</text-delta>
@end example
Here, @var{data} is the raw svndiff data, encoded in the MIME Base64
format.
@c -----------------------------------------------------------------------
@node Property Deltas
@section Property Deltas
A property delta describes changes to a property list, of the sort
associated with files, directories, and directory entries, and revision
numbers (@pxref{Properties}). A property delta can record creating,
deleting, and changing the text of any number of properties.
A property delta is an unordered set of name/change pairs. No two
pairs within a given property delta have the same name. A pair's name
indicates the property affected, and the change indicates what happens
to its value. There are two kinds of changes:
@table @code
@item set @var{value}
Change the value of the named property to the byte string @var{value}.
If there is no property with the given name, one is added to the
property list.
@item delete
Remove the named property from the property list.
@end table
At the moment, the @code{set} command can either create or change a
property value. However, this simplification means that the server
cannot distinguish between a client which believes it is creating a
value afresh, and a client which believes it is changing the value of an
existing property. It may simplify conflict detection to divide
@code{set} into two separate @code{add} and @code{change} operations.
In the future, we may add a @code{text-delta} change, which specifies a
change to an existing property's value as a text delta. This would give
us a compact way to describe small changes to large property values.
The concrete form of a property delta is a well-formed XML element,
having the following form:
@example
<property-delta>@var{change}@dots{}</property-delta>
@end example
Each @var{change} in a property delta has one of the following forms:
@example
<set name='@var{name}'>@var{value}</set>
<delete name='@var{name}'/>
@end example
The @var{name} attribute of a @code{set} or @code{delete} element gives
the name of the property to change. The @var{value} of a @code{set}
element gives the new value of the property.
If either the property name or the property value contains the
characters @samp{&}, @samp{<}, or @samp{'}, they should be replaced with
the sequences @samp{&#38}, @samp{&#60}, or @samp{&#39}, respectively.
@c -----------------------------------------------------------------------
@node Tree Deltas
@section Tree Deltas
A tree delta describes changes between two directory trees, the
@dfn{source tree} and the @dfn{target tree}. Tree deltas can describe
copies, renames, and deletions of files and directories, changes to file
contents, and changes to property lists. A tree delta can also carry
information about how the files in the target tree are derived from the
files in the source tree, if this information is available.
The format for tree deltas described here is easy to compute from a
Subversion working directory, and easy to apply to a Subversion
repository. Furthermore, the size of a tree delta in this format is
independent of the commands used to produce the target tree --- it
depends only on the degree of difference between the source and target
trees.
A tree delta is interpreted in the context of three parameters:
@itemize @bullet
@item
@var{source-root}, the name of the directory to which this complete
tree delta applies,
@item
@var{revision}, indicating a particular revision of @dots{}
@item
@var{source-dir}, which is a directory in the source tree that we are
currently modifying to yield @dots{}
@item
@dots{} @dfn{target-dir} --- the directory we're constructing.
@end itemize
When we start interpreting a tree delta, @var{source-root},
@var{source-dir}, and @var{target-dir} are all equal. As we walk the
tree delta, @var{target-dir} walks the tree we are constructing, and
@var{source-dir} walks the corresponding portion of the source tree,
which we use as the original. @var{Source-root} remains constant as we
walk the delta; we may use it to choose new source trees.
A tree delta is a list of changes of the form
@example
<tree-delta>@var{change}@dots{}</tree-delta>
@end example
which describe how to edit the contents of @var{source-dir} to yield
@var{target-dir}. There are three kinds of changes:
@table @code
@item <delete name='@var{name}'/>
@var{Source-dir} has an entry named @var{name}, which is not present in
@var{target-dir}.
@item <add name='@var{name}'>@var{content}</add>
@var{target-dir} has an entry named @var{name}, which is not present in
@var{source-dir}; @var{content} describes the file or directory to which
the new directory entry refers.
@item <replace name='@var{name}'>@var{content}</replace>
Both @var{source-dir} and @var{target-dir} have an entry named
@var{name}, which has changed; @var{content} describes the new file or
directory.
@end table
Any entries in @var{source-dir} whose names aren't mentioned are assumed
to appear unchanged in @var{target-dir}. Thus, an empty
@code{tree-delta} element indicates that @var{target-dir} is identical
to @var{source-dir}.
In the change descriptions above, each @var{content} takes one of the
following forms:
@table @code
@item <file @var{ancestor}>@var{prop-delta} @var{text-delta}</file>
The given @var{target-dir} entry refers to a file, @var{f}.
@var{Ancestor} indicates which file in the source tree @var{f} is
derived from, if any.
@var{Prop-delta} is a property delta describing how @var{f}'s properties
differ from that ancestor; it may be omitted, indicating that the
properties are unchanged.
@var{Text-delta} is a text delta describing how to construct @var{f}
from that ancestor; it may also be omitted, indicating that @var{f}'s
text is identical to its ancestor's.
@item <file @var{ancestor}/>
An abbreviation for @code{<file @var{ancestor}></file>} --- a file
element with no property or text delta, thus describing a file identical
to its ancestor.
@item <directory @var{ancestor}>@var{prop-delta} @var{tree-delta}</directory>
The given @var{target-dir} entry refers to a subdirectory, @var{sub}.
@var{Ancestor} indicates which directory in the source tree @var{sub} is
derived from, if any.
@var{Prop-delta} is a property delta describing how @var{sub}'s
properties differ from that ancestor; it may be omitted, indicating that
the properties are unchanged.
@var{Tree-delta} describes how to construct @var{sub} from that
ancestor; it may be omitted, indicating that the directory is identical
to its ancestor. @var{Tree-delta} should be interpreted with a new
@var{target-dir} of @file{@var{target-dir}/@var{name}}.
Since @var{tree-delta} is itself a complete tree delta structure, tree
deltas are themselves trees, whose structure is a subgraph of the target
tree.
@item <directory @var{ancestor}/>
An abbreviation for @code{<directory @var{ancestor}></directory>} --- a
directory element with no property or tree delta, thus describing a
directory identical to its ancestor.
@end table
The @var{content} of a @code{add} or @code{replace} tag may also contain
a property delta, describing changes to the properties of that
@emph{directory entry}.
In the @code{file} and @code{directory} elements described above, each
@var{ancestor} has one of the following forms:
@table @code
@item ancestor='@var{path}'
The ancestor of the new or changed file or directory is
@file{@var{source-root}/@var{path}}, in @var{revision}. When this
appears as an attribute of a @code{file} element, the element's text
delta should be applied to @file{@var{source-root}/@var{path}}. When
this appears as an attribute of a @code{directory} element,
@file{@var{source-root}/@var{path}} should be the new @var{source-dir}
for interpreting that element's tree delta.
@item new='true'
This indicates that the file or directory has no ancestor in the source
tree. When followed by a @var{text-delta}, that delta should be applied
to the empty file to yield the new text; when followed by a
@var{tree-delta}, that delta should be evaluated as if @var{source-dir}
were an imaginary empty directory.
@item @var{nothing}
If neither an @code{ancestor} nor a @code{new} attribute is given, this
is an abbreviation for @code{ancestor='@var{source-dir}/@var{name}'},
with the same revision number. This makes the common case --- files or
directories modified in place --- more compact.
@end table
If the @var{ancestor} spec is not @code{new='true'}, it may also contain
the text @code{revision='@var{rev}'}, indicating a new value for
@var{revision}, in which we should find the ancestor.
If a filename or path appearing as a @var{name} or @var{path} in the
description above contains the characters @samp{&}, @samp{<}, or
@samp{'}, they should be replaced with the sequences @samp{&#38;},
@samp{&#60;}, or @samp{&#39;}, respectively.
Suppose we have the following source tree:
@example
/dir1/file1
file2
dir2/file3
file4
dir3/file5
file6
@end example
If we edit the contents of @file{/dir1/file1}, we can describe the
effect on the tree with the following tree delta, to be applied to the
root:
@example
<tree-delta>
<replace name='dir1'>
<directory>
<tree-delta>
<replace name='file1'>
<file>@var{text-delta}</file>
</replace>
</tree-delta>
</directory>
</replace>
</tree-delta>
@end example
The outer @code{tree-delta} element describes the changes made to the root
directory. Within the root directory, there are changes in @file{dir1},
described by the nested @code{tree-delta}. Within @file{/dir1}, there are
changes in @file{file1}, described by the @var{text-delta}.
If we had edited both @file{/dir1/file1} and @file{/dir1/file2}, then
there would simply be two @code{replace} elements in the inner
@code{tree-delta}.
As another example, starting from the same source tree, suppose we
rename @file{/dir1/file1} to @file{/dir1/file8}:
@example
<tree-delta>
<replace name='dir1'>
<directory>
<tree-delta>
<delete name='file1'/>
<add name='file8'>
<file ancestor='/dir1/file1'/>
</add>
</tree-delta>
</directory>
</replace>
</tree-delta>
@end example
As above, the inner @code{tdelta} describes how @file{/dir1} has
changed: the entry for @file{/dir1/file1} has disappeared, but there is
a new entry, @file{/dir1/file8}, which is derived from and textually
identical to @file{/dir1/file1} in the source directory. This is just
an indirect way of describing the rename.
Why is it necessary to be so indirect? Consider the delta representing
the result of:
@enumerate
@item
renaming @file{/dir1/file1} to @file{/dir1/tmp},
@item
renaming @file{/dir1/file2} to @file{/dir1/file1}, and
@item
renaming @file{/dir1/tmp} to @file{/dir1/file2}
@end enumerate
(in other words, exchanging @file{file1} and @file{file2}):
@example
<tree-delta>
<replace name='dir1'>
<directory>
<tree-delta>
<replace name='file1'>
<file ancestor='/dir1/file2'/>
</replace>
<replace name='file2'>
<file ancestor='/dir1/file1'/>
</replace>
</tree-delta>
</directory>
</replace>
</tree-delta>
@end example
The indirectness allows the tree delta to capture an arbitrary
rearrangement without resorting to temporary filenames.
Another example, starting from the same source tree:
@enumerate
@item
rename @file{/dir1/dir2} to @file{/dir1/dir4},
@item
rename @file{/dir1/dir3} to @file{/dir1/dir2}, and
@item
move @file{file3} from @var{/dir1/dir4} to @var{/dir1/dir2}.
@end enumerate
Note that @file{file3}'s path has remained the same, even though the
directories around it have changed. Here is the tree delta:
@example
<tree-delta>
<replace name='dir1'>
<directory>
<tree-delta>
<replace name='dir2'>
<directory ancestor='/dir1/dir3'>
<tree-delta>
<add name='file3'>
<file ancestor='/dir1/dir2/file3'/>
</add>
</tree-delta>
</directory>
</replace>
<delete name='dir3'/>
<add name='dir4'>
<directory ancestor='/dir1/dir2'>
<tree-delta>
<delete name='file3'/>
</tree-delta>
</directory>
</add>
</tree-delta>
</directory>
</replace>
</tree-delta>
@end example
In other words:
@itemize @bullet
@item
@file{/dir1} has changed;
@item
the new directory @file{/dir1/dir2} is derived from the old
@file{/dir1/dir3}, and contains a new entry @file{file3}, derived from
the old @file{/dir1/dir2/file3};
@item
there is no longer any @file{/dir1/dir3}; and
@item
the new directory @file{/dir1/dir4} is derived from the old
@file{/dir1/dir2}, except that its entry for @file{file3} is now gone.
@end itemize
Some more possible maneuvers, left as exercises for the reader:
@itemize @bullet
@item
Delete @file{dir2}, and then create a file named @file{dir2}.
@item
Rename @file{/dir1/dir2} to @file{/dir1/dir4}; move @file{file2} into
@file{/dir1/dir4}; and move @file{file3} into @var{/dir1/dir3}.
@item
Move @file{dir2} into @file{dir3}, and move @file{dir3} into @file{/}.
@end itemize
@c ----------------------------------------------------------------------
@node Postfix Text Deltas
@section Postfix Text Deltas
It is sometimes useful to represent a set of changes to a tree without
providing text deltas in the middle of the stream. Text deltas are
often large and expensive to compute, and tree deltas can be useful
without them. For example, one can detect whether two changes might
conflict --- whether they change the same file, for example --- without
knowing exactly how the conflicting files changed.
For this reason, our XML representation of a tree delta allows the text
deltas to come @emph{after} the </tree-delta> closure. This allows the
client to receive early notice of conflicts: during a @code{svn commit}
command, the client sends a tree-delta to the server, which can check
for skeletal conflicts and reject the commit, before the client takes the
time to transmit the (possibly large) textual changes. This potentially
saves quite a bit of network traffic.
In terms of XML, postfix text deltas are split into two parts. The
first part appears "in-line" and contains a reference ID. The second
part appears after the tree delta is complete. Here's an example:
@example
<tree-delta>
<replace name="foo.c">
<file>
<text-delta-ref id="123">
</file>
</replace>
<add name="bar.c">
<file>
<text-delta-ref id="456">
</file>
</add>
</tree-delta>
<text-delta id="123">@emph{data}</text-delta>
<text-delta id="456">@emph{data}</text-delta>
@end example
@c ----------------------------------------------------------------------
@node Serializing Deltas via the "Editor" Interface
@section Serializing Deltas via the "Editor" Interface
The static XML forms above are useful as an import/export format, and as
a visualization aid, but we also need a way to express a delta as a
@emph{series of operations}, to implement directory tree diffing and
patching. Subversion defines a standard set of such operations in the
vtable @code{svn_delta_edit_fns_t}, a set of function prototypes which
anyone may implement (see @file{svn_delta.h}).
Each function in an instance of @code{svn_delta_edit_fns_t}
(colloquially known as an @dfn{editor}) implements some distinct subtask
of editing a directory tree. In fact, if you compare the editor
function prototypes to the XML elements described previously, you'll
notice a fairly strict correspondence: there's one function for
replacing a directory, another function for replacing a file, one for
adding a directory, another for adding a file, a function for deleting,
and so on.
Although the editor interface was designed around the general idea of
making changes to a directory tree, a specific implementation's behavior
depends on its role. For example, the versioning filesystem library
offers an editor that creates new revisions, while the working copy
library offers an editor that updates working copies. And the network
layer offers an editor that turns editing calls into wire protocol,
which is then converted back into editing calls on the other side! All
of these different tasks can share a single interface, because they are
all fundamentally about the same thing: expressing and applying
differences between directory trees.
Like the XML forms, a series of editor calls must follow certain nesting
conventions; these conventions are implicit in the interface, in that
some of the functions take arguments that can only be obtained from
previous calls to other editor functions.
Editors can best be understood by watching one work on a real directory
tree. For example:
@c kff todo: fooo working here.
Suppose that the user has made a number of local changes to her working
copy and wants to commit them to the repository. Let's represent her
changes with the same tree-delta from a previous example. Notice that
she has also made textual modifications to @file{file3}; hence the
in-line @code{<text-delta>}:
@example
<tree-delta>
<open name='dir1'>
<directory>
<tree-delta>
<open name='dir2'>
<directory ancestor='/dir1/dir3'>
<tree-delta>
<add name='file3'>
<file ancestor='/dir1/dir2/file3'>
<text-delta>@emph{data}</text-delta>
</file>
</add>
</tree-delta>
</directory>
</open>
<delete name='dir3'/>
<add name='dir4'>
<directory ancestor='/dir1/dir2'>
<tree-delta>
<delete name='file3'/>
</tree-delta>
</directory>
</add>
</tree-delta>
</directory>
</open>
</tree-delta>
@end example
So how does the client send this information to the server?
In a nutshell: the tree-delta is @emph{streamed} over the network, as a
series of individual commands given in depth-first order.
Let's be more specific. The server presents the client with an object
of type @code{struct svn_delta_edit_fns_t}, colloquially known as an
@dfn{editor}. An editor is really just table of functions; each
function makes a change to a filesystem. Agent A (who has a private
filesystem) presents an editor to agent B. Agent B then calls the
editor's functions to change A's filesystem. B is said to be
@dfn{driving} the editor.
As Karl Fogel likes to describe the process, if one thinks of the
tree-delta as a lion, the editor is a "hoop" that the lion jumps through
-- each portion of the lion being decomposed through time.
B cannot call the functions in any willy-nilly order; there are some
logical restrictions. In particular, as B drives the editor, it
receives opaque data structures which represent directories and files.
It must use and pass these structures, known as @dfn{batons}, to make
further function calls.
As an example, let's watch how the client would transmit the above
tree-delta to the repository. (The description below is slightly
simplified. For exact interface details, see
@file{subversion/include/svn_delta.h}.)
@enumerate
@item
The repository hands an "editor" to the client.
@item
The client begins by calling
@code{root_baton = editor->open_root();}
The client now has an opaque object, @dfn{root_baton}, which represents
the root of the repository's filesystem.
@item
@code{dir1_baton = editor->open_dir("dir1", root_baton);}
Notice that @emph{root_baton} gives the client free license to make any
changes it wants in the repository's root directory -- until, of course,
it calls @code{editor->close_dir(root_baton)}.
The first change made was a replacement of @file{dir1}. In return, the
client now has a new opaque data structure that can be used to change
@file{dir1}.
@item
@code{dir2_baton = editor->open_dir("dir2", "/dir1/dir3", dir1_baton);}
The @emph{dir1_baton} is now used to replace @file{dir2} with a
directory whose ancestor is @file{/dir1/dir3}.
@item
@code{file_baton = editor->add_file("file3", "/dir1/dir2/file3", dir2_baton);}
Edits are now made to @file{dir2} (using @emph{dir2_baton}). In
particular, a new file is added to this directory whose ancestor is
@file{/dir1/dir2/file3}.
@item
Now the text-delta associated with @emph{file_baton} needs to be
transmitted:
@code{window_handler = editor->apply_textdelta(file_baton);}
Text-deltas themselves, for network efficiency, are streamed in
"chunks". So instead of receiving a baton object, we now have a routine
that is able to receive any number of small "windows" of text-delta
data.
We won't go into the details of the @code{svn_txdelta_*} functions right
here; but suffice it to say that these routines are used for sending
svndiff data to the @emph{window_handler} routine.
@item
@code{editor->close_file(file_baton);}
The client is done sending the file's text-delta, so it releases the
file baton.
@item
@code{editor->close_dir(dir2_baton));}
The client is done making changes to @file{dir2}, so it releases its
baton as well.
@item
The client isn't yet finished with @file{dir1}, however; it makes two
more edits:
@code{editor->delete_item("dir3", dir1_baton);} @*
@code{dir4_baton = editor->add_dir("dir4", "/dir1/dir2", dir1_baton);}
@emph{(The function's name is @code{delete_item} rather than
@code{delete} to avoid gratuitous incompatibility with C++, where
@code{delete} is a reserved keyword.)}
@item
Within the directory @file{dir4} (whose ancestry is @file{/dir1/dir2}),
the client removes a file:
@code{editor->delete_item("file3", dir4_baton);}
@item
The client is now finished with both @file{dir4}, as well as its parent
@file{dir1}:
@code{editor->close_dir(dir4_baton);} @*
@code{editor->close_dir(dir1_baton);}
@item
The entire tree-delta is complete. The repository knows this when the
root directory is closed:
@code{editor->close_dir(root_baton);}
@end enumerate
Of course, at any point above, the repository may reject an edit. If
this is the case, the client aborts the transmission and the repository
hasn't changed a bit. (Thank goodness for transactions!)
Note, however, that this "editor interface" works in the other direction
as well. When the repository wishes to update a client's working copy,
it is the @emph{client's} reponsibility to give a custom editor-object
to the server, and the @emph{server} is the editor-driver.
Here are the main advantages of this interface:
@itemize @bullet
@item
@emph{Consistency}. Tree-deltas move across the network, in both
directions, using the same interface.
@item
@emph{Flexibility}. Custom editor-implementations can be written to do
anything one might want; the editor-driver has no idea what is
happening on the other side of the interface. For example, an editor
might
@itemize @bullet
@item
Output XML that matches the tree-delta DTD above;
@item
Output human-readable descriptions of the edits taking place;
@item
Modify a filesystem
@end itemize
@end itemize
Whatever the case, it's easy to "swap" editors around, and make client
and server do new and interesting things.