| = How to interpret Subversion dumpfiles = |
| |
| Version 1.3, 2020-12-21 |
| |
| == Introduction == |
| |
| The Subversion dumpfile format is a serialized description of the |
| actions required to (re)build a version history. from scratch. |
| |
| The goal of this document is that it be sufficient for people writing |
| dumpfile interpreters to emulate the actions the dumpfile describes on |
| a versioned filesystem-like store, such as another version-control |
| system. It derives from and incorporates some incomplete notes from |
| before r39883. |
| |
| === Unresolved questions === |
| |
| 1. In interpreting a Node record which has both a copyfrom source and |
| a property section, it is possible that the copy source node itself |
| has a property section. How are they to be combined? |
| |
| 2. The section on the semantics of kinds of operations documents a |
| minor bug at r39883 in the behavior of "add". Has this been fixed? |
| |
| 3. The diff format and compression (if any) used in Version 3 |
| format are not described. |
| |
| Portions of text relevant to these questions are tagged with FIXME. |
| |
| == Syntax == |
| |
| === Encoding and delimiters === |
| |
| Subversion dumpfiles are plain byte streams. The structural parts are |
| ASCII. Text sections and property key/value pairs may be interpreted |
| as binary data in any encoding by client tools. |
| |
| A dumpfile consists of four kinds of records. A record is a group of |
| RFC822-style header lines (each consisting of a key, followed by a |
| colon, followed by text data to end of line), followed by an empty |
| spacer line, followed optionally by a body section. If the body |
| section is present, another empty spacer line separates it from the |
| following record. |
| |
| For forward compatibility, unrecognized headers are ignored. |
| |
| === Record types === |
| |
| Dumpfiles include four record types. Two, the version stamp and UUID |
| record, consist of single header lines. The bulk of a dumpfile |
| consists of Revision and Node records. |
| |
| ==== Version stamp records ==== |
| |
| A version stamp record is always the first line of the file and |
| looks like this: |
| |
| ------------------------------------------------------------------- |
| SVN-fs-dump-format-version: <N>\n |
| ------------------------------------------------------------------- |
| |
| where <N> is replaced by the dump format version. Except where |
| specified, the descriptions in this document apply to all |
| versions of the format. |
| |
| ==== UUID records ==== |
| |
| Versions 2 and later may have a UUID record following the version |
| stamp. It is of the form |
| |
| ------------------------------------------------------------------- |
| UUID: <hex-string> |
| ------------------------------------------------------------------- |
| |
| where the <hex-string> is the UUID of the originating repository. |
| An example UUID is "7bf7a5ef-cabf-0310-b7d4-93df341afa7e". |
| |
| As generated by Subversion, these UUIDs are "Version 1", incorporating |
| the MAC of the originating machine. The presentation is in RFC4122 |
| form without the "urn:" or "uuid:" prefixes. |
| |
| ==== Revision records ==== |
| |
| A Revision record has three headers and is usually followed by a |
| property section. Expect the following form and sequence: |
| |
| ------------------------------------------------------------------- |
| Revision-number: <N> |
| [Prop-content-length: <P>] |
| Content-length: <L> |
| ! |
| ------------------------------------------------------------------- |
| |
| with the Revision-number header always first and the '!' indicating |
| a mandatory empty spacer line. <P> gives the length in bytes of the |
| following property section. <L> gives the body length of the entire |
| Revision record. These two numbers will be *identical* for a Revision |
| record; the Content-length header is added for the benefit of software |
| that can parse RFC-822 messages. |
| |
| A revision record is followed by zero or more Node records (see below). |
| |
| Revisions are always dumped in monotonically increasing |
| revision-number order. The date property of a revision may be |
| explicitly set to any value, or even unset, but normally a sequence of |
| commits will have monotonically increasing timestamps. |
| |
| ==== Node records ==== |
| |
| Each Revision record is followed by one or more Node records. |
| Node records have the following sequence of header lines: |
| |
| ------------------------------------------------------------------- |
| Node-path: <path/to/node/in/filesystem> |
| [Node-kind: {file | dir}] |
| Node-action: {change | add | delete | replace} |
| [Node-copyfrom-rev: <rev>] |
| [Node-copyfrom-path: <path> ] |
| [Text-copy-source-md5: <blob>] |
| [Text-copy-source-sha1: <blob>] |
| [Text-content-md5: <blob>] |
| [Text-content-sha1: <blob>] |
| [Text-content-length: <T>] |
| [Prop-content-length: <P>] |
| [Content-length: Y] |
| ! |
| ------------------------------------------------------------------- |
| |
| Bracketing in [] indicates optional lines; { | } is an alternation group. |
| |
| Dump decoders should be prepared for the optional lines after |
| Node-action to be in any order, except that Content-length is |
| always last if it present. |
| |
| A Node record describes an action on a path relative to the repository |
| root, and always begins with the Node-path specification. |
| |
| The Node-kind line indicates whether the path is a file or directory. |
| The header value will be one of the strings "file" or "dir". |
| This header may be (and usually is) absent if the node action is a delete. |
| |
| The Node-action line is always present and specifies the type of |
| operation for this node. The header value is one of the strings |
| "change", "add", "delete", or "replace". These operations will be |
| described in detail later in this document. |
| |
| Either both the Node-copyfrom-rev and Node-copyfrom-path lines will be |
| present, or neither will be. They pair to describe a copy source for |
| the node. Copy-source semantics will be described in detail later in |
| this document. |
| |
| The Text-content-{md5,sha1} and Text-copy-source-{md5,sha1} lines are |
| hash integrity checks and will be present only if Text-content-length |
| and the copyfrom pair (respectively) are also present. A decoder may |
| use them to verify that the source content they refer to has not been |
| corrupted. |
| |
| Text-content-length will be present only when there is a text section. |
| Zero is a legal value for this length, indicating an empty file. |
| |
| Prop-content-length will be present only when there is a properties section. |
| A properties section has non-zero length even if it has no entries. |
| |
| Content-length will be present if there is either a text or a |
| properties section. This is not always the case. In particular, |
| a delete operation cannot have either. Some other operations that use |
| copyfrom sources may also not have either. |
| |
| Again, the '!' stands in for a mandatory empty line following the |
| RFC822-style headers. A body may follow. |
| |
| === Property sections === |
| |
| A Revision record *may* have a property section, and a Node record *may* |
| have a property section. Every record with a property section has |
| a Prop-content-length header. |
| |
| A property section consists of pairs of key and value records and |
| is ended by a fixed trailer. Here is an example attached to a |
| Revision record: |
| |
| ------------------------------------------------------------------- |
| Revision-number: 1422 |
| Prop-content-length: 80 |
| Content-length: 80 |
| |
| K 6 |
| author |
| V 7 |
| sussman |
| K 3 |
| log |
| V 33 |
| Added two files, changed a third. |
| PROPS-END |
| ------------------------------------------------------------------- |
| |
| The fixed trailer is "PROPS-END\n" and its length is included in the |
| Prop-content-length. Before it, each K and V record consists of a |
| header line giving the length of the key or value content in bytes. |
| The content follows. The content is itself always followed by \n. |
| |
| In version 3 of the format, a third type 'D' of property record is |
| introduced to describe property deletion. This feature will be |
| described later, in the specification of delta dumps. |
| |
| == Semantics == |
| |
| === The kinds of things === |
| |
| There are four kinds of things described by a dumpfile: paths, |
| properties, content, and flows. The distinctions among content, |
| paths, and flows matter for understanding some operations. |
| |
| A path is a filesystem location (a file or directory). There are two |
| kinds of paths in a dumpfile; node paths and copy sources. |
| |
| Properties are key-value pairs associated with revisions or paths. |
| Subversion interprets and reserves some properties, those beginning |
| with "svn:". Others are not interpreted by Subversion; they may |
| may be set and read for the convenience of other applications, such |
| as repository browsers or translators. |
| |
| A flow is a sequence of actions on a file or directory path that is |
| considered to be a single history for change-tracking purposes. |
| Creating a flow tells Subversion that you want to track the history of |
| the path or paths it contains. Destroying a flow breaks the chain of |
| history; changes will not be tracked across the break, even if another |
| flow is created at the same path. A copy operation creates a new |
| flow connected to the flow from which it was copied. |
| |
| Content is what file paths point at (one timewise slice of a flow). It |
| is the payload of program source code, documents, images, and so forth |
| that a version control system actually manages. |
| |
| A Node record describes a change in properties, the addition or deletion |
| of a flow, or a change in content. It must do at least one of these things, |
| otherwise it would be a no-op and omitted. |
| |
| When no copyfrom is present, and the action isn't an add or copy, then |
| the kind of the thing identified by (PATH, REVISION) must agree with |
| the kind of the thing identified by (PATH, -1+REVISION). |
| |
| Terminological node: in Subversion-speak, the term "node" is |
| historically ambiguous. Sometimes it refers to what this document |
| calls a "flow", and sometimes it refers to the internal per-revision |
| structure that a Node record represents (that is, just one action in a |
| flow). For clarity, most of this document avoids the term "node" in |
| favor of the more specific "flow" and "Node record", but knowing |
| about this issue will help if you read the Ancient History section. |
| |
| === The kinds of operations === |
| |
| .File operations |
| |====================================================================== |
| | | add | delete | replace | change | |
| |Can have text section? | optional | no | optional | optional | |
| |Can have property section? | optional | no | optional | optional | |
| |Can have copy source? | optional | no | optional | no | |
| |Fails on existent path | yes* | no | no | no | |
| |Fails on non-existent path | no | yes | yes | yes | |
| |====================================================================== |
| |
| FIXME: As of December 2011 there is a minor bug: Adding a file with history |
| twice _in two different revisions_ succeeds silently. |
| |
| .Directory operations |
| |====================================================================== |
| | | add | delete | replace | change | |
| |Can have text section? | no | no | no | no | |
| |Can have property section? | optional | no | optional | required | |
| |Can have copy source? | optional | no | optional | no | |
| |Fails on existent path | yes | no | no | no | |
| |Fails on non-existent path | no | yes | yes | yes | |
| |====================================================================== |
| |
| A Node record represents an operation that does one of four things: add, |
| delete, change, or replace. |
| |
| Node records can carry content in one (or both!) of two ways: from a text |
| section or from a copy source (that is, a copy-path and copy-revision |
| pair). |
| |
| Giving a copy source appends the node to the flow of which that source |
| is part; when you 'add' or 'replace' with a copy source, the content |
| at the path becomes a copy of the source (but see below for a |
| qualification about directories). |
| |
| Giving a text section also changes the content of the flow. In the |
| (unusual) case that a node has both a copy source and a text section, |
| the correct semantics is to attach the path to the source flow and |
| then change the content. |
| |
| An add operation creates a new flow for a file or directory. See the |
| table above for possible operand combinations. |
| |
| A delete operation deletes a flow and its content. If the path is a |
| file, the file is deleted. If the path is a directory, the directory |
| and all its children are deleted. A subsequent add at the same path |
| will create a new and different flow with its own history. |
| |
| A change operation changes properties on a file or directory path. See the |
| table above for possible operand combinations. |
| |
| A replace operation behaves exactly like a delete followed by an add |
| (destroying an old flow, producing a new one) when it has no copy |
| source. When a replace has a copy source, it produces a new flow |
| with history extending back through the copy source. A Node record |
| representing a replace operation may have a property section. |
| |
| The main reason "replace" exists is because it helps sequential |
| processors of the dump stream avoid possibly notifying about multiple |
| actions on the same path. |
| |
| It is even possible to have a replace with a copyfrom source *and* |
| text, such as would result from this on the client side: |
| |
| ------------------------------------------------------------------- |
| $ svn rm dir/file.txt |
| $ svn cp otherdir/otherfile.txt dir/file.txt |
| $ echo "Replacement text" > dir/file.txt |
| $ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt and replace its text, too." |
| ------------------------------------------------------------------- |
| |
| Subversion filesystems do not allow the root directory ("/") to be |
| deleted or replaced. |
| |
| === Some details about copyfroms === |
| |
| The source and target of a copyfrom are always of like kind; that is, |
| Subversion dump will never generate a node with a source type of file |
| and a target type of directory or vice-versa. |
| |
| Interpreting copyfrom_path for file copies is straightforward; the |
| target pathname gets the contents of the source pathname. |
| |
| Directory copies (the primitive beneath branching and tagging) are |
| tricky. For each source path under the source directory, a new path |
| is generated by removing the head segment of the pathname that is |
| the source directory. That new path under the target directory gets |
| the content of the source path. |
| |
| After this operation: |
| |
| ------------------------------------------------------------------- |
| Node-path: x/y/z |
| Node-kind: dir |
| Node-action: add |
| Node-copyfrom-rev: 10 |
| Node-copyfrom-path: a/b/c |
| ------------------------------------------------------------------- |
| |
| the file a/b/c/d will have been be copied to x/y/z/d. |
| |
| A single revision may include multiple copyfrom Node records, even multiple |
| copyfroms to the same directory, even mixed directory and file copies |
| to the same directory. |
| |
| === Properties and persistence === |
| |
| The properties section of a Revision record consists of some (possibly |
| empty) subset of the three reserved revision properties: svn:author, |
| svn:date, and svn:log, along with any other revision properties. |
| |
| The revision properties do not persist to later revisions. Each revision |
| has exactly the revision properties specified in its revision record, or |
| no revision properties if there is no property section. |
| |
| Node properties, like node text, are persistent: once set, they are |
| carried forward until modified by a future property section, both on the |
| same path and on the target of a copyfrom operation. |
| |
| In non-delta format, to delete a given property from a path, a dumpfile |
| generator will issue a Node record with all other properties listed in it; |
| to delete all properties from a path, the dumpfile generator will |
| issue a node with an empty properties section. (This is different from |
| an *absent* properties section, which will change no properties.) |
| |
| === Special properties === |
| |
| svn:author:: |
| Identification of the commit author, normally a Unix username on |
| the repository server. |
| |
| svn:date:: |
| An RFC3339 UTC timestamp to millisecond precision. Here is an example: |
| "2011-11-30T16:41:55.154754Z". The length and precision of this |
| field is fixed: 27 characters with the trailing Zulu timezone |
| required. |
| |
| svn:log:: |
| The change comment associated with this revision. |
| |
| The following per-node properties are interpreted by Subversion itself: |
| |
| svn:executable:: |
| When this is set on a node, the executable bits in the metadata of |
| checkout copies corresponding file are set. On Unix systems which |
| of the x bits are set is filtered by the current umask. |
| |
| svn:ignore, svn:global-ignores:: |
| Set ignore patterns for Subversion clients. Read the Subversion |
| manual for semantic details. |
| |
| svn:mergeinfo, svn:mime-type, svn:keywords, svn:eol-style, svn:externals:: |
| Read the Subversion manual for semantic details. |
| |
| All other property settings are ignored by Subversion, but preserved |
| in repositories and dump streams. |
| |
| === Representation of symbolic links === |
| |
| When the Subversion client sends a content blob representing a |
| symbolic link (that is, with the svn:special property) the contents of |
| the blob is not just the link's target path. It will have the prefix |
| "link ". The client likewise interprets this prefix at checkout time. |
| |
| In the future, other special blob formats with other prefix keywords may |
| be defined. None such yet exist as of revision 1441992 (February 2013). |
| |
| === Implementation pragmatics === |
| |
| Dumpfile generation in general is not canonical, nor minimalist. In |
| particular, interpreters should be prepared to see many empty property |
| sections (nominally deleting all properties) when there are no |
| previous properties set to be deleted. |
| |
| Because directory operations with copyfroms don't specify all the file |
| paths they modify, an interpreter for this format must build a map of |
| the paths in the file store it is manipulating, and update that map as |
| it processes each Node record. |
| |
| On a repository with thousands of commits, the per-revision list of |
| maps can become quite large. It is tempting to think that the file map |
| for each revision can be discarded after it is processed unless it is |
| a source revision for a copyfrom, but there are cases in which doing |
| this will leave you unable to trace ancestry chains through copies. |
| |
| Instead, it is advisable to build your filemaps using a copy-on-write |
| store. |
| |
| == An example == |
| |
| Here's an example of revision 1422, which added a new directory |
| "baz", added a new file "bop" inside it, and modified the file "foo.c": |
| |
| ------------------------------------------------------------------- |
| Revision-number: 1422 |
| Prop-content-length: 80 |
| Content-length: 80 |
| |
| K 6 |
| author |
| V 7 |
| sussman |
| K 3 |
| log |
| V 33 |
| Added two files, changed a third. |
| PROPS-END |
| |
| Node-path: bar/baz |
| Node-kind: dir |
| Node-action: add |
| Prop-content-length: 35 |
| Content-length: 35 |
| |
| K 10 |
| svn:ignore |
| V 4 |
| TAGS |
| PROPS-END |
| |
| |
| Node-path: bar/baz/bop |
| Node-kind: file |
| Node-action: add |
| Prop-content-length: 76 |
| Text-content-length: 54 |
| Content-length: 130 |
| |
| K 14 |
| svn:executable |
| V 2 |
| on |
| K 12 |
| svn:keywords |
| V 15 |
| LastChangedDate |
| PROPS-END |
| Here is the text of the newly added 'bop' file. |
| Whee. |
| |
| Node-path: bar/foo.c |
| Node-kind: file |
| Node-action: change |
| Text-content-length: 102 |
| Content-length: 102 |
| |
| Here is the fulltext of my change to an existing /bar/foo.c. |
| Notice that this file has no properties. |
| ------------------------------------------------------------------- |
| |
| == Format variants == |
| |
| === Version 3 format === |
| |
| Version 3 format is a delta dump; text changes are represented |
| as diffs against the original file, and properties as incremental |
| changes to a persistent set (that is, a property section does not |
| necessarily implicitly clear the property set on a path before the |
| new property settings are evaluated). |
| |
| This change is a space optimization. It requires additional |
| computing time to integrate the diff history. |
| |
| Version 3 is generated by SVN versions 1.1.0-present, if requested by the user. |
| |
| This format is equivalent to the VERSION 2 format except for the |
| following: |
| |
| 1. The format starts with the new version number of the dump format |
| ("SVN-fs-dump-format-version: 3\n"). |
| |
| 2. There are several new optional headers for Node records: |
| |
| ------------------------------------------------------------------- |
| [Text-delta: true|false] |
| [Prop-delta: true|false] |
| [Text-delta-base-md5: blob] |
| [Text-delta-base-sha1: blob] |
| [Text-copy-source-sha1: blob] |
| [Text-content-sha1: blob] |
| ------------------------------------------------------------------- |
| |
| The default value for the boolean headers is "false". If the value is |
| set to "true", then the text and property contents will be treated |
| as deltas against the previous contents of the flow (as determined |
| by copy history for adds with history, or by the value in the |
| previous revision for changes--just as with commits). |
| |
| Property deltas have the same format as regular property lists except |
| that (1) properties with the same value as in the previous contents of |
| the flow are not printed, and (2) deleted properties will be written |
| out as |
| |
| ------------------------------------------------------------------- |
| D <name length> |
| <name> |
| ------------------------------------------------------------------- |
| |
| just as a regular property is printed, but with the "K " changed to a |
| "D " and with no value part. |
| |
| Text deltas are written out as a series of svndiff0 windows. If |
| Text-delta-base-md5 is provided, it is the checksum of the base to |
| which the text delta is applied; note that older versions (pre-1.5) of |
| 'svnadmin load' may ignore the checksum. |
| |
| Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not |
| currently used by the loader. They are written by 1.6-and-later versions of |
| Subversion so that future loaders can optionally choose which checksum to |
| use for checking for corruption. |
| |
| === Archaic version 1 format === |
| |
| There are actually two types of version 1 dump streams. The regular ones |
| are generated since r2634 (svn 0.14.0). Older ones also claim to be |
| version 1, but miss the Props-content-length and Text-content-length |
| fields in the block header. In those days there *always* was a |
| properties block. |
| |
| This note is included for historical completeness only, at is it highly |
| unlikely that any Subversion instances that old remain in production. |
| |
| == Implementation choices for optional behaviour == |
| |
| This section lists some of the ways existing implementations interpret the |
| optional aspects of the specification. |
| |
| When a Revision record has no revision properties, svnadmin and svnrdump |
| write an empty properties section whereas svndumpfilter omits the properties |
| section. (At least in Subversion 1.0 through 1.8.) |
| |
| == Ancient history == |
| |
| Old discussion: |
| |
| (This file started as a proposal, preserved here for posterity.) |
| |
| A proposal for an svn filesystem dump/restore format. |
| |
| === Two problems we want to solve === |
| |
| 1. When we change our node-id schema, we need to migrate all of our |
| data (by dumping and restoring). |
| |
| 2. Serves as a backup format. Could be read by other software tools |
| someday. |
| |
| |
| === Design Goals === |
| |
| A. Written as two new public functions in svn_fs.h. To be invoked |
| by new 'svnadmin' subcommands. |
| |
| B. Format uses only timeless fs concepts. |
| |
| The dump format needs to reference concepts that we *know* are |
| general enough to never change. These concepts must exist |
| independently of any internal node-id schema, or any DB storage |
| backend. In other words, we're talking about the basic ideas in |
| our original "design spec" from May 2000. |
| |
| === Format Semantics === |
| |
| Here are the timeless semantics of our fs design -- the things that |
| would be stored in our dump format. |
| |
| - A filesystem is an array of trees. |
| Each tree is called a "revision" and has unversioned properties attached. |
| |
| - A revision has a tree of "nodes" hanging off of it. |
| Actually, the nodes in the filesystem form a DAG. A revision |
| always points to an initial node that represents the 'root' of some tree. |
| |
| - The majority of a tree's nodes are hard-links (references) to |
| nodes that were created in earlier trees. |
| |
| - A node contains |
| |
| - versioned text |
| - versioned properties |
| - predecessor history: "which node am I a variant of?" |
| - copy history: "which node am I a copy of?" |
| |
| The history values can be non-existent (meaning the node is |
| completely new), or can have a value of {revision, path}. |
| |
| === Refinement of proposal #2: === |
| |
| (after discussion with gstein) |
| |
| Each node starts with RFC822-style headers at the top. The final |
| header is a 'Content-length:', followed by the content, so record |
| boundaries can be inferred. |
| |
| The content section has two implicit parts: a property hash, and the |
| fulltext. The division between these two sections is implied by the |
| "PROPS-END\n" tag at the end of the prophash. In the case of a |
| directory node or a revision, only the prophash is present. |
| |
| //End of document. |