blob: 4cd48ac378aa6fa1e8797db604438956c4598848 [file] [log] [blame]
= How to interpret Subversion dumpfiles =
Version 1.1, 2013-02-02
== Introduction ==
The Subversion dumpfile format is a serialized description of the
actions required to (re)build a version history. from scratch.
The goal of this document is that it be sufficient for people writing
dumpfile interpreters to emulate the actions the dumpfile describes on
a versioned filesystem-like store, such as another version-control
system. It derives from and incorporates some incomplete notes from
before r39883.
=== Unresolved questions ===
1. In interpreting a Node record which has both a copyfrom source and
a property section, it is possible that the copy source node itself
has a property section. How are they to be combined?
2. The section on the semantics of kinds of operations documents a
minor bug at r39883 in the behavior of "add". Has this been fixed?
Portions of text relevant to these questions are tagged with FIXME.
== Syntax ==
=== Encoding and delimiters ===
Subversion dumpfiles are plain byte streams. The structural parts are
ASCII. Text sections and property key/value pairs may be interpreted
as binary data in any encoding by client tools.
A dumpfile consists of four kinds of records. A record is a group of
RFC822-style header lines (each consisting of a key, followed by a
colon, followed by text data to end of line), followed by an empty
spacer line, followed optionally by a body section. If the body
section is present, another empty spacer line separates it from the
following record.
For forward compatibility, unrecognized headers are ignored.
=== Record types ===
Dumpfiles include four record types. Two, the version stamp and UUID
record, consist of single header lines. The bulk of a dumpfile
consists of Revision and Node records.
==== Version stamp records ====
A version stamp record is always the first line of the file and
looks like this:
-------------------------------------------------------------------
SVN-fs-dump-format-version: <N>\n
-------------------------------------------------------------------
where <N> is replaced by the dump format version. Except where
specified, the descriptions in this document apply to all
versions of the format.
==== UUID records ====
Versions 2 and later may have a UUID record following the version
stamp. It is of the form
-------------------------------------------------------------------
UUID: <hex-string>
-------------------------------------------------------------------
where the <hex-string> is the UUID of the originating repository.
An example UUID is "7bf7a5ef-cabf-0310-b7d4-93df341afa7e".
==== Revision records ====
A Revision record has three headers and is usually followed by a
property section. Expect the following form and sequence:
-------------------------------------------------------------------
Revision-number: <N>
[Prop-content-length: <P>]
Content-length: <L>
!
-------------------------------------------------------------------
with the Revision-number header always first and the '!' indicating
a mandatory empty spacer line. <P> gives the length in bytes of the
following property section. <L> gives the body length of the entire
Revision record. These two numbers will be *identical* for a Revision
record; the Content-length header is added for the benefit of software
that can parse RFC-822 messages.
A revision record is followed by one or more Node records (see below).
==== Node records ====
Each Revision record is followed by one or more Node records.
Node records have the following sequence of header lines:
-------------------------------------------------------------------
Node-path: <path/to/node/in/filesystem>
[Node-kind: {file | dir}]
Node-action: {change | add | delete | replace}
[Node-copyfrom-rev: <rev>]
[Node-copyfrom-path: <path> ]
[Text-copy-source-md5: <blob>]
[Text-copy-source-sha1: <blob>]
[Text-content-md5: <blob>]
[Text-content-sha1: <blob>]
[Text-content-length: <T>]
[Prop-content-length: <P>]
[Content-length: Y]
!
-------------------------------------------------------------------
Bracketing in [] indicates optional lines; { | } is an alternation group.
Dump decoders should be prepared for the optional lines after
Node-action to be in any order, except that Content-length is
always last if it present.
A Node record describes an action on a path relative to the repository
root, and always begins with the Node-path specification.
The Node-kind line indicates whether the path is a file or directory.
The header value will be one of the strings "file" or "dir".
This header may be (and usually is) absent if the node action is a delete.
The Node-action line is always present and specifies the type of
operation for this node. The header value is one of the strings
"change", "add", "delete", or "replace". These operations will be
described in detail later in this document.
Either both the Node-copyfrom-rev and Node-copyfrom-path lines will be
present, or neither will be. They pair to describe a copy source for
the node. Copy-source semantics will be described in detail later in
this document.
The Text-content-{md5,sha1} and Text-copy-source-{md5,sha1} lines are
hash integrity checks and will be present only if Text-content-length
and the copyfrom pair (respectively) are also present. A decoder may
use them to verify that the source content they refer to has not been
corrupted.
Text-content-length will be present only when there is a text section.
Zero is a legal value for this length, indicating an empty file.
Prop-content-length will be present only when there is a properties section.
Content-length will be present if there is either a text or a
properties section. This is not always the case. In particular,
a delete operation cannot have either. Some other operations that use
copyfrom sources may also not have either.
Again, the '!' stands in for a mandatory empty line following the
RFC822-style headers. A body may follow.
=== Property sections ===
A Revision record *may* have a property section, and a Node record *may*
have a property section. Every record with a property section has
a Prop-content-length header.
A property section consists of pairs of key and value records and
is ended by a fixed trailer. Here is an example attached to a
Revision record:
-------------------------------------------------------------------
Revision-number: 1422
Prop-content-length: 80
Content-length: 80
K 6
author
V 7
sussman
K 3
log
V 33
Added two files, changed a third.
PROPS-END
-------------------------------------------------------------------
The fixed trailer is "PROPS-END\n" and its length is included in the
Prop-content-length. Before it, each K and V record consists of a
header line giving the length of the key or value content in bytes.
The content follows. The content is itself always followed by \n.
In version 3 of the format, a third type 'D' of property record is
introduced to describe property deletion. This feature will be
described later, in the specification of delta dumps.
== Semantics ==
=== The kinds of things ===
There are four kinds of things described by a dumpfile: paths,
properties, content, and flows. The distinctions among content,
paths, and flows matter for understanding some operations.
A path is a filesystem location (a file or directory). There are two
kinds of paths in a dumpfile; node paths and copy sources.
Properties are key-value pairs associated with revisions or paths.
Subversion interprets and reserves some properties, those beginning
with "svn:". Others are not interpreted by Subversion; they may
may be set and read for the convenience of other applications, such
as repository browsers or translators.
A flow is a sequence of actions on a file or directory path that is
considered to be a single history for change-tracking purposes.
Creating a flow tells Subversion that you want to track the history of
the path or paths it contains. Destroying a flow breaks the chain of
history; changes will not be tracked across the break, even if another
flow is created at the same path. A copy operation creates a new
flow connected to the flow from which it was copied.
Content is what file paths point at (one timewise slice of a flow). It
is the payload of program source code, documents, images, and so forth
that a version control system actually manages.
A Node record describes a change in properties, the addition or deletion
of a flow, or a change in content. It must do at least one of these things,
otherwise it would be a no-op and omitted.
When no copyfrom is present, and the action isn't an add or copy, then
the kind of the thing identified by (PATH, REVISION) must agree with
the kind of the thing identified by (PATH, -1+REVISION).
Terminological node: in Subversion-speak, the term "node" is
historically ambiguous. Sometimes it refers to what this document
calls a "flow", and sometimes it refers to the internal per-revision
structure that a Node record represents (that is, just one action in a
flow). For clarity, most of this document avoids the term "node" in
favor of the more specific "flow" and "Node record", but knowing
about this issue will help if you read the Ancient History section.
=== The kinds of operations ===
.File operations
|======================================================================
| | add | delete | replace | change |
|Can have text section? | optional | no | optional | optional |
|Can have property section? | optional | no | optional | optional |
|Can have copy source? | optional | no | optional | no |
|Fails on existent path | yes* | no | no | no |
|Fails on non-existent path | no | yes | yes | yes |
|======================================================================
FIXME: As of December 2011 there is a minor bug: Adding a file with history
twice _in two different revisions_ succeeds silently.
.Directory operations
|======================================================================
| | add | delete | replace | change |
|Can have text section? | no | no | no | no |
|Can have property section? | optional | no | optional | required |
|Can have copy source? | optional | no | optional | no |
|Fails on existent path | yes | no | no | no |
|Fails on non-existent path | no | yes | yes | yes |
|======================================================================
A Node record represents an operation that does one of four things: add,
delete, change, or replace.
Node records can carry content in one (or both!) of two ways: from a text
section or from a copy source (that is, a copy-path and copy-revision
pair).
Giving a copy source appends the node to the flow of which that source
is part; when you 'add' or 'replace' with a copy source, the content
at the path becomes a copy of the source (but see below for a
qualification about directories).
Giving a text section also changes the content of the flow. In the
(unusual) case that a node has both a copy source and a text section,
the correct semantics is to attach the path to the source flow and
then change the content.
An add operation creates a new flow for a file or directory. See the
table above for possible operand combinations.
A delete operation deletes a flow and its content. If the path is a
file, the file is deleted. If the path is a directory, the directory
and all its children are deleted. A subsequent add at the same path
will create a new and different flow with its own history.
A change operation changes properties on a file or directory path. See the
table above for possible operand combinations.
A replace operation behaves exactly like a delete followed by an add
(destroying an old flow, producing a new one) when it has no copy
source. When a replace has a copy source, it produces a new flow
with history extending back through the copy source. A Node record
representing a replace operation may have a property section.
The main reason "replace" exists is because it helps sequential
processors of the dump stream avoid possibly notifying about multiple
actions on the same path.
It is even possible to have a replace with a copyfrom source *and*
text, such as would result from this on the client side:
-------------------------------------------------------------------
$ svn rm dir/file.txt
$ svn cp otherdir/otherfile.txt dir/file.txt
$ echo "Replacement text" > dir/file.txt
$ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt and replace its text, too."
-------------------------------------------------------------------
Subversion filesystems do not allow the root directory ("/") to be
deleted or replaced.
=== Some details about copyfroms ===
The source and target of a copyfrom are always of like kind; that is,
Subversion dump will never generate a node with a source type of file
and a target type of directory or vice-versa.
Interpreting copyfrom_path for file copies is straightforward; the
target pathname gets the contents of the source pathname.
Directory copies (the primitive beneath branching and tagging) are
tricky. For each source path under the source directory, a new path
is generated by removing the head segment of the pathname that is
the source directory. That new path under the target directory gets
the content of the source path.
After this operation:
-------------------------------------------------------------------
Node-path: x/y/z
Node-kind: dir
Node-action: add
Node-copyfrom-rev: 10
Node-copyfrom-path: a/b/c
-------------------------------------------------------------------
the file a/b/c/d will have been be copied to x/y/z/d.
A single revision may include multiple copyfrom Node records, even multiple
copyfroms to the same directory, even mixed directory and file copies
to the same directory.
=== Properties and persistence ===
The properties section of a Revision record consists of some (possibly
empty) subset of the three reserved revision properties: svn:author,
svn:date, and svn:log, along with any other revision properties.
The revision properties do not persist to later revisions. Each revision
has exactly the revision properties specified in its revision record, or
no revision properties if there is no property section.
The key thing to know about Node properties is that they are
persistent, once set, until modified by a future property
section on the same path.
Normally, a dumpfile re-lists the entire property set for a directory
or file in every Node record that changes any part of it. (But see
the material on delta dumps for an exception.)
This implies that to delete a given property from a path, a dumpfile
generator will issue a Node record with all other properties listed in it;
to delete all properties from a path, the dumpfile generator will
simply issue a node with an empty properties section. Note that this
is different from an *absent* properties section, which will change
no properties and will be associated with a change to content!
=== Representation of symbolic links ===
When the Subversion client sends a content blob representing a
symbolic link (that is, with the svn:special property) the contents of
the blob is not just the link's target path. It will have the prefix
"link ". The client likewise interprets this prefix at checkout time.
In the future, other special blob formats with other prefix keywords may
be defined. None such yet exist as of revision 1441992 (February 2013).
=== Implementation pragmatics ===
Because directory operations with copyfroms don't specify all the file
paths they modify, an interpreter for this format must build a map of
the paths in the file store it is manipulating, and update that map as
it processes each Node record.
On a repository with thousands of commits, the per-revision list of
maps can become quite large. For space economy, the file map for each
revision can be discarded after it is processed *unless it is a source
revision for a copyfrom*.
== An example ==
Here's an example of revision 1422, which added a new directory
"baz", added a new file "bop" inside it, and modified the file "foo.c":
-------------------------------------------------------------------
Revision-number: 1422
Prop-content-length: 80
Content-length: 80
K 6
author
V 7
sussman
K 3
log
V 33
Added two files, changed a third.
PROPS-END
Node-path: bar/baz
Node-kind: dir
Node-action: add
Prop-content-length: 35
Content-length: 35
K 10
svn:ignore
V 4
TAGS
PROPS-END
Node-path: bar/baz/bop
Node-kind: file
Node-action: add
Prop-content-length: 76
Text-content-length: 54
Content-length: 130
K 14
svn:executable
V 2
on
K 12
svn:keywords
V 15
LastChangedDate
PROPS-END
Here is the text of the newly added 'bop' file.
Whee.
Node-path: bar/foo.c
Node-kind: file
Node-action: change
Text-content-length: 102
Content-length: 102
Here is the fulltext of my change to an existing /bar/foo.c.
Notice that this file has no properties.
-------------------------------------------------------------------
== Format variants ==
=== Version 3 format ===
Version 3 format is a delta dump; text changes are represented
as diffs against the original file, and properties as incremental
changes to a persistent set (that is, a property section does not
necessarily implicitly clear the property set on a path before the
new property settings are evaluated).
This change is a space optimization. It requires additional
computing time to integrate the diff history.
Version 3 is generated by SVN versions 1.1.0-present, if requested by the user.
This format is equivalent to the VERSION 2 format except for the
following:
1. The format starts with the new version number of the dump format
("SVN-fs-dump-format-version: 3\n").
2. There are several new optional headers for Node records:
-------------------------------------------------------------------
[Text-delta: true|false]
[Prop-delta: true|false]
[Text-delta-base-md5: blob]
[Text-delta-base-sha1: blob]
[Text-copy-source-sha1: blob]
[Text-content-sha1: blob]
-------------------------------------------------------------------
The default value for the boolean headers is "false". If the value is
set to "true", then the text and property contents will be treated
as deltas against the previous contents of the flow (as determined
by copy history for adds with history, or by the value in the
previous revision for changes--just as with commits).
Property deltas have the same format as regular property lists except
that (1) properties with the same value as in the previous contents of
the flow are not printed, and (2) deleted properties will be written
out as
-------------------------------------------------------------------
D <name length>
<name>
-------------------------------------------------------------------
just as a regular property is printed, but with the "K " changed to a
"D " and with no value part.
Text deltas are written out as a series of svndiff0 windows. If
Text-delta-base-md5 is provided, it is the checksum of the base to
which the text delta is applied; note that older versions (pre-1.5) of
'svnadmin load' may ignore the checksum.
Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not
currently used by the loader. They are written by 1.6-and-later versions of
Subversion so that future loaders can optionally choose which checksum to
use for checking for corruption.
=== Archaic version 1 format ===
There are actually two types of version 1 dump streams. The regular ones
are generated since r2634 (svn 0.14.0). Older ones also claim to be
version 1, but miss the Props-content-length and Text-content-length
fields in the block header. In those days there *always* was a
properties block.
This note is included for historical completeness only, at is it highly
unlikely that any Subversion instances that old remain in production.
== Implementation choices for optional behaviour ==
This section lists some of the ways existing implementations interpret the
optional aspects of the specification.
When a Revision record has no revision properties, svnadmin and svnrdump
write an empty properties section whereas svndumpfilter omits the properties
section. (At least in Subversion 1.0 through 1.8.)
== Ancient history ==
Old discussion:
(This file started as a proposal, preserved here for posterity.)
A proposal for an svn filesystem dump/restore format.
=== Two problems we want to solve ===
1. When we change our node-id schema, we need to migrate all of our
data (by dumping and restoring).
2. Serves as a backup format. Could be read by other software tools
someday.
=== Design Goals ===
A. Written as two new public functions in svn_fs.h. To be invoked
by new 'svnadmin' subcommands.
B. Format uses only timeless fs concepts.
The dump format needs to reference concepts that we *know* are
general enough to never change. These concepts must exist
independently of any internal node-id schema, or any DB storage
backend. In other words, we're talking about the basic ideas in
our original "design spec" from May 2000.
=== Format Semantics ===
Here are the timeless semantics of our fs design -- the things that
would be stored in our dump format.
- A filesystem is an array of trees.
Each tree is called a "revision" and has unversioned properties attached.
- A revision has a tree of "nodes" hanging off of it.
Actually, the nodes in the filesystem form a DAG. A revision
always points to an initial node that represents the 'root' of some tree.
- The majority of a tree's nodes are hard-links (references) to
nodes that were created in earlier trees.
- A node contains
- versioned text
- versioned properties
- predecessor history: "which node am I a variant of?"
- copy history: "which node am I a copy of?"
The history values can be non-existent (meaning the node is
completely new), or can have a value of {revision, path}.
=== Refinement of proposal #2: ===
(after discussion with gstein)
Each node starts with RFC822-style headers at the top. The final
header is a 'Content-length:', followed by the content, so record
boundaries can be inferred.
The content section has two implicit parts: a property hash, and the
fulltext. The division between these two sections is implied by the
"PROPS-END\n" tag at the end of the prophash. In the case of a
directory node or a revision, only the prophash is present.
//End of document.