notes/dump-load-format.txt - subversion - Git at Google

 = How to interpret Subversion dumpfiles =

 Version 1.1, 2013-02-02

 == Introduction ==

 The Subversion dumpfile format is a serialized description of the
 actions required to (re)build a version history. from scratch.

 The goal of this document is that it be sufficient for people writing
 dumpfile interpreters to emulate the actions the dumpfile describes on
 a versioned filesystem-like store, such as another version-control
 system.  It derives from and incorporates some incomplete notes from
 before r39883.

 === Unresolved questions ===

 1. In interpreting a Node record which has both a copyfrom source and
 a property section, it is possible that the copy source node itself
 has a property section.  How are they to be combined?

 2. The section on the semantics of kinds of operations documents a
 minor bug at r39883 in the behavior of "add".  Has this been fixed?

 Portions of text relevant to these questions are tagged with FIXME.

 == Syntax ==

 === Encoding and delimiters ===

 Subversion dumpfiles are plain byte streams. The structural parts are
 ASCII.  Text sections and property key/value pairs may be interpreted
 as binary data in any encoding by client tools.

 A dumpfile consists of four kinds of records.  A record is a group of
 RFC822-style header lines (each consisting of a key, followed by a
 colon, followed by text data to end of line), followed by an empty
 spacer line, followed optionally by a body section.  If the body
 section is present, another empty spacer line separates it from the
 following record.

 For forward compatibility, unrecognized headers are ignored.

 === Record types ===

 Dumpfiles include four record types.  Two, the version stamp and UUID
 record, consist of single header lines. The bulk of a dumpfile
 consists of Revision and Node records.

 ==== Version stamp records ====

 A version stamp record is always the first line of the file and
 looks like this:

 -------------------------------------------------------------------
 SVN-fs-dump-format-version: <N>\n
 -------------------------------------------------------------------

 where <N> is replaced by the dump format version. Except where
 specified, the descriptions in this document apply to all
 versions of the format.

 ==== UUID records ====

 Versions 2 and later may have a UUID record following the version
 stamp. It is of the form

 -------------------------------------------------------------------
 UUID: <hex-string>
 -------------------------------------------------------------------

 where the <hex-string> is the UUID of the originating repository.
 An example UUID is "7bf7a5ef-cabf-0310-b7d4-93df341afa7e".

 ==== Revision records ====

 A Revision record has three headers and is usually followed by a
 property section.  Expect the following form and sequence:

 -------------------------------------------------------------------
 Revision-number: <N>
 [Prop-content-length: <P>]
 Content-length: <L>
 !
 -------------------------------------------------------------------

 with the Revision-number header always first and the '!' indicating
 a mandatory empty spacer line.  <P> gives the length in bytes of the
 following property section. <L> gives the body length of the entire
 Revision record.  These two numbers will be *identical* for a Revision
 record; the Content-length header is added for the benefit of software
 that can parse RFC-822 messages.

 A revision record is followed by one or more Node records (see below).

 ==== Node records ====

 Each Revision record is followed by one or more Node records.
 Node records have the following sequence of header lines:

 -------------------------------------------------------------------
 Node-path: <path/to/node/in/filesystem>
 [Node-kind: {file | dir}]
 Node-action: {change | add | delete | replace}
 [Node-copyfrom-rev: <rev>]
 [Node-copyfrom-path: <path> ]
 [Text-copy-source-md5: <blob>]
 [Text-copy-source-sha1: <blob>]
 [Text-content-md5: <blob>]
 [Text-content-sha1: <blob>]
 [Text-content-length: <T>]
 [Prop-content-length: <P>]
 [Content-length: Y]
 !
 -------------------------------------------------------------------

 Bracketing in [] indicates optional lines; { | } is an alternation group.

 Dump decoders should be prepared for the optional lines after
 Node-action to be in any order, except that Content-length is
 always last if it present.

 A Node record describes an action on a path relative to the repository
 root, and always begins with the Node-path specification.

 The Node-kind line indicates whether the path is a file or directory.
 The header value will be one of the strings "file" or "dir".
 This header may be (and usually is) absent if the node action is a delete.

 The Node-action line is always present and specifies the type of
 operation for this node.  The header value is one of the strings
 "change", "add", "delete", or "replace".  These operations will be
 described in detail later in this document.

 Either both the Node-copyfrom-rev and Node-copyfrom-path lines will be
 present, or neither will be.  They pair to describe a copy source for
 the node. Copy-source semantics will be described in detail later in
 this document.

 The Text-content-{md5,sha1} and Text-copy-source-{md5,sha1} lines are
 hash integrity checks and will be present only if Text-content-length
 and the copyfrom pair (respectively) are also present. A decoder may
 use them to verify that the source content they refer to has not been
 corrupted.

 Text-content-length will be present only when there is a text section.
 Zero is a legal value for this length, indicating an empty file.

 Prop-content-length will be present only when there is a properties section.

 Content-length will be present if there is either a text or a
 properties section.  This is not always the case.  In particular,
 a delete operation cannot have either.  Some other operations that use
 copyfrom sources may also not have either.

 Again, the '!' stands in for a mandatory empty line following the
 RFC822-style headers. A body may follow.

 === Property sections ===

 A Revision record *may* have a property section, and a Node record *may*
 have a property section. Every record with a property section has
 a Prop-content-length header.

 A property section consists of pairs of key and value records and
 is ended by a fixed trailer.  Here is an example attached to a
 Revision record:

 -------------------------------------------------------------------
 Revision-number: 1422
 Prop-content-length: 80
 Content-length: 80

 K 6
 author
 V 7
 sussman
 K 3
 log
 V 33
 Added two files, changed a third.
 PROPS-END
 -------------------------------------------------------------------

 The fixed trailer is "PROPS-END\n" and its length is included in the
 Prop-content-length. Before it, each K and V record consists of a
 header line giving the length of the key or value content in bytes.
 The content follows.  The content is itself always followed by \n.

 In version 3 of the format, a third type 'D' of property record is
 introduced to describe property deletion. This feature will be
 described later, in the specification of delta dumps.

 == Semantics ==

 === The kinds of things ===

 There are four kinds of things described by a dumpfile: paths,
 properties, content, and flows.  The distinctions among content,
 paths, and flows matter for understanding some operations.

 A path is a filesystem location (a file or directory).  There are two
 kinds of paths in a dumpfile; node paths and copy sources.

 Properties are key-value pairs associated with revisions or paths.
 Subversion interprets and reserves some properties, those beginning
 with "svn:". Others are not interpreted by Subversion; they may
 may be set and read for the convenience of other applications, such
 as repository browsers or translators.

 A flow is a sequence of actions on a file or directory path that is
 considered to be a single history for change-tracking purposes.
 Creating a flow tells Subversion that you want to track the history of
 the path or paths it contains. Destroying a flow breaks the chain of
 history; changes will not be tracked across the break, even if another
 flow is created at the same path.  A copy operation creates a new
 flow connected to the flow from which it was copied.

 Content is what file paths point at (one timewise slice of a flow). It
 is the payload of program source code, documents, images, and so forth
 that a version control system actually manages.

 A Node record describes a change in properties, the addition or deletion
 of a flow, or a change in content.  It must do at least one of these things,
 otherwise it would be a no-op and omitted.

 When no copyfrom is present, and the action isn't an add or copy, then
 the kind of the thing identified by (PATH, REVISION) must agree with
 the kind of the thing identified by (PATH, -1+REVISION).

 Terminological node: in Subversion-speak, the term "node" is
 historically ambiguous.  Sometimes it refers to what this document
 calls a "flow", and sometimes it refers to the internal per-revision
 structure that a Node record represents (that is, just one action in a
 flow).  For clarity, most of this document avoids the term "node" in
 favor of the more specific "flow" and "Node record", but knowing
 about this issue will help if you read the Ancient History section.

 === The kinds of operations ===

 .File operations
 |======================================================================
 |                           |   add    | delete | replace  |  change  |
 |Can have text section?     | optional |   no   | optional | optional |
 |Can have property section? | optional |   no   | optional | optional |
 |Can have copy source?      | optional |   no   | optional |    no    |
 |Fails on existent path     |   yes*   |   no   |    no    |    no    |
 |Fails on non-existent path |    no    |  yes   |   yes    |   yes    |
 |======================================================================

 FIXME: As of December 2011 there is a minor bug: Adding a file with history
 twice _in two different revisions_ succeeds silently.

 .Directory operations
 |======================================================================
 |                           |   add    | delete | replace  |  change  |
 |Can have text section?     |    no    |   no   |    no    |    no    |
 |Can have property section? | optional |   no   | optional | required |
 |Can have copy source?      | optional |   no   | optional |    no    |
 |Fails on existent path     |   yes    |   no   |    no    |    no    |
 |Fails on non-existent path |    no    |  yes   |   yes    |   yes    |
 |======================================================================

 A Node record represents an operation that does one of four things: add,
 delete, change, or replace.

 Node records can carry content in one (or both!) of two ways: from a text
 section or from a copy source (that is, a copy-path and copy-revision
 pair).

 Giving a copy source appends the node to the flow of which that source
 is part; when you 'add' or 'replace' with a copy source, the content
 at the path becomes a copy of the source (but see below for a
 qualification about directories).

 Giving a text section also changes the content of the flow. In the
 (unusual) case that a node has both a copy source and a text section,
 the correct semantics is to attach the path to the source flow and
 then change the content.

 An add operation creates a new flow for a file or directory. See the
 table above for possible operand combinations.

 A delete operation deletes a flow and its content. If the path is a
 file, the file is deleted.  If the path is a directory, the directory
 and all its children are deleted. A subsequent add at the same path
 will create a new and different flow with its own history.

 A change operation changes properties on a file or directory path. See the
 table above for possible operand combinations.

 A replace operation behaves exactly like a delete followed by an add
 (destroying an old flow, producing a new one) when it has no copy
 source. When a replace has a copy source, it produces a new flow
 with history extending back through the copy source. A Node record
 representing a replace operation may have a property section.

 The main reason "replace" exists is because it helps sequential
 processors of the dump stream avoid possibly notifying about multiple
 actions on the same path.

 It is even possible to have a replace with a copyfrom source *and*
 text, such as would result from this on the client side:

 -------------------------------------------------------------------
 $ svn rm dir/file.txt
 $ svn cp otherdir/otherfile.txt dir/file.txt
 $ echo "Replacement text" > dir/file.txt
 $ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt and replace its text, too."
 -------------------------------------------------------------------

 Subversion filesystems do not allow the root directory ("/") to be
 deleted or replaced.

 === Some details about copyfroms ===

 The source and target of a copyfrom are always of like kind; that is,
 Subversion dump will never generate a node with a source type of file
 and a target type of directory or vice-versa.

 Interpreting copyfrom_path for file copies is straightforward; the
 target pathname gets the contents of the source pathname.

 Directory copies (the primitive beneath branching and tagging) are
 tricky.  For each source path under the source directory, a new path
 is generated by removing the head segment of the pathname that is
 the source directory.  That new path under the target directory gets
 the content of the source path.

 After this operation:

 -------------------------------------------------------------------
 Node-path: x/y/z
 Node-kind: dir
 Node-action: add
 Node-copyfrom-rev: 10
 Node-copyfrom-path: a/b/c
 -------------------------------------------------------------------

 the file a/b/c/d will have been be copied to x/y/z/d.

 A single revision may include multiple copyfrom Node records, even multiple
 copyfroms to the same directory, even mixed directory and file copies
 to the same directory.

 === Properties and persistence ===

 The properties section of a Revision record consists of some (possibly
 empty) subset of the three reserved revision properties: svn:author,
 svn:date, and svn:log, along with any other revision properties.

 The revision properties do not persist to later revisions.  Each revision
 has exactly the revision properties specified in its revision record, or
 no revision properties if there is no property section.

 The key thing to know about Node properties is that they are
 persistent, once set, until modified by a future property
 section on the same path.

 Normally, a dumpfile re-lists the entire property set for a directory
 or file in every Node record that changes any part of it. (But see
 the material on delta dumps for an exception.)

 This implies that to delete a given property from a path, a dumpfile
 generator will issue a Node record with all other properties listed in it;
 to delete all properties from a path, the dumpfile generator will
 simply issue a node with an empty properties section. Note that this
 is different from an *absent* properties section, which will change
 no properties and will be associated with a change to content!

 === Representation of symbolic links ===

 When the Subversion client sends a content blob representing a
 symbolic link (that is, with the svn:special property) the contents of
 the blob is not just the link's target path. It will have the prefix
 "link ".  The client likewise interprets this prefix at checkout time.

 In the future, other special blob formats with other prefix keywords may
 be defined.  None such yet exist as of revision 1441992 (February 2013).

 === Implementation pragmatics ===

 Because directory operations with copyfroms don't specify all the file
 paths they modify, an interpreter for this format must build a map of
 the paths in the file store it is manipulating, and update that map as
 it processes each Node record.

 On a repository with thousands of commits, the per-revision list of
 maps can become quite large. For space economy, the file map for each
 revision can be discarded after it is processed *unless it is a source
 revision for a copyfrom*.

 == An example ==

 Here's an example of revision 1422, which added a new directory
 "baz", added a new file "bop" inside it, and modified the file "foo.c":

 -------------------------------------------------------------------
 Revision-number: 1422
 Prop-content-length: 80
 Content-length: 80

 K 6
 author
 V 7
 sussman
 K 3
 log
 V 33
 Added two files, changed a third.
 PROPS-END

 Node-path: bar/baz
 Node-kind: dir
 Node-action: add
 Prop-content-length: 35
 Content-length: 35

 K 10
 svn:ignore
 V 4
 TAGS
 PROPS-END


 Node-path: bar/baz/bop
 Node-kind: file
 Node-action: add
 Prop-content-length: 76
 Text-content-length: 54
 Content-length: 130

 K 14
 svn:executable
 V 2
 on
 K 12
 svn:keywords
 V 15
 LastChangedDate
 PROPS-END
 Here is the text of the newly added 'bop' file.
 Whee.

 Node-path: bar/foo.c
 Node-kind: file
 Node-action: change
 Text-content-length: 102
 Content-length: 102

 Here is the fulltext of my change to an existing /bar/foo.c.
 Notice that this file has no properties.
 -------------------------------------------------------------------

 == Format variants ==

 === Version 3 format ===

 Version 3 format is a delta dump; text changes are represented
 as diffs against the original file, and properties as incremental
 changes to a persistent set (that is, a property section does not
 necessarily implicitly clear the property set on a path before the
 new property settings are evaluated).

 This change is a space optimization. It requires additional
 computing time to integrate the diff history.

 Version 3 is generated by SVN versions 1.1.0-present, if requested by the user.

 This format is equivalent to the VERSION 2 format except for the
 following:

 1. The format starts with the new version number of the dump format
    ("SVN-fs-dump-format-version: 3\n").

 2. There are several new optional headers for Node records:

 -------------------------------------------------------------------
 [Text-delta: true|false]
 [Prop-delta: true|false]
 [Text-delta-base-md5: blob]
 [Text-delta-base-sha1: blob]
 [Text-copy-source-sha1: blob]
 [Text-content-sha1: blob]
 -------------------------------------------------------------------

 The default value for the boolean headers is "false".  If the value is
 set to "true", then the text and property contents will be treated
 as deltas against the previous contents of the flow (as determined
 by copy history for adds with history, or by the value in the
 previous revision for changes--just as with commits).

 Property deltas have the same format as regular property lists except
 that (1) properties with the same value as in the previous contents of
 the flow are not printed, and (2) deleted properties will be written
 out as

 -------------------------------------------------------------------
 D <name length>
 <name>
 -------------------------------------------------------------------

 just as a regular property is printed, but with the "K " changed to a
 "D " and with no value part.

 Text deltas are written out as a series of svndiff0 windows.  If
 Text-delta-base-md5 is provided, it is the checksum of the base to
 which the text delta is applied; note that older versions (pre-1.5) of
 'svnadmin load' may ignore the checksum.

 Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not
 currently used by the loader.  They are written by 1.6-and-later versions of
 Subversion so that future loaders can optionally choose which checksum to
 use for checking for corruption.

 === Archaic version 1 format ===

 There are actually two types of version 1 dump streams. The regular ones
 are generated since r2634 (svn 0.14.0). Older ones also claim to be
 version 1, but miss the Props-content-length and Text-content-length
 fields in the block header. In those days there *always* was a
 properties block.

 This note is included for historical completeness only, at is it highly
 unlikely that any Subversion instances that old remain in production.

 == Implementation choices for optional behaviour ==

 This section lists some of the ways existing implementations interpret the
 optional aspects of the specification.

 When a Revision record has no revision properties, svnadmin and svnrdump
 write an empty properties section whereas svndumpfilter omits the properties
 section. (At least in Subversion 1.0 through 1.8.)

 == Ancient history ==

 Old discussion:

 (This file started as a proposal, preserved here for posterity.)

 A proposal for an svn filesystem dump/restore format.

 === Two problems we want to solve ===

  1.  When we change our node-id schema, we need to migrate all of our
      data (by dumping and restoring).

  2.  Serves as a backup format.  Could be read by other software tools
      someday.


 === Design Goals ===

  A.  Written as two new public functions in svn_fs.h.  To be invoked
      by new 'svnadmin' subcommands.

  B.  Format uses only timeless fs concepts.

      The dump format needs to reference concepts that we *know* are
      general enough to never change.  These concepts must exist
      independently of any internal node-id schema, or any DB storage
      backend.  In other words, we're talking about the basic ideas in
      our original "design spec" from May 2000.

 === Format Semantics ===

 Here are the timeless semantics of our fs design -- the things that
 would be stored in our dump format.

   - A filesystem is an array of trees.
     Each tree is called a "revision" and has unversioned properties attached.

   - A revision has a tree of "nodes" hanging off of it.
     Actually, the nodes in the filesystem form a DAG.  A revision
     always points to an initial node that represents the 'root' of some tree.

   - The majority of a tree's nodes are hard-links (references) to
     nodes that were created in earlier trees.

   - A node contains

         - versioned text
         - versioned properties
         - predecessor history:  "which node am I a variant of?"
         - copy history:  "which node am I a copy of?"

     The history values can be non-existent (meaning the node is
     completely new), or can have a value of {revision, path}.

 === Refinement of proposal #2: ===

 (after discussion with gstein)

 Each node starts with RFC822-style headers at the top.  The final
 header is a 'Content-length:', followed by the content, so record
 boundaries can be inferred.

 The content section has two implicit parts: a property hash, and the
 fulltext.  The division between these two sections is implied by the
 "PROPS-END\n" tag at the end of the prophash.  In the case of a
 directory node or a revision, only the prophash is present.

 //End of document.