tools/cvs2svn/design-notes.txt - subversion - Git at Google

                        How cvs2svn.py Works
                       =====================

 A cvs2svn run consists of 5 passes.  Every pass but the last saves its
 data to a file on disk, so that a) we don't hold huge amounts of state
 in memory, and b) the conversion process is resumable.  The final pass
 makes the actual Subversion commits.

 Pass 1:
 =======

 The goal of this pass is to get a summary of all the revisions for
 each file written out to 'cvs2svn-data.revs'; at the end of this
 stage, revisions will be grouped by RCS file, not by logical commits.

 We walk over the repository, processing each RCS file with with
 rcsparse.parse(), using cvs2svn's CollectData class, which is a
 subclass of rcsparse.Sink(), the parser's callback class.  For each
 RCS file, the first thing the parser encounters is the administrative
 header, including the head revision, the principal branch, symbolic
 names, RCS comments, etc.  The main thing that happens here is that
 CollectData.define_tag() is invoked on each symbolic name and its
 attached revision, so all the tags and branches of this file get
 collected.

 Next, the parser hits the revision summary section.  That's the part
 of the RCS file that looks like this:

    1.6
    date	2002.06.12.04.54.12;	author captnmark;	state Exp;
    branches
    	1.6.2.1;
    next	1.5;

    1.5
    date	2002.05.28.18.02.11;	author captnmark;	state Exp;
    branches;
    next	1.4;

    [...]

 For each revision summary, CollectData.define_revision() is invoked,
 recording that revision's metadata in the self.rev_data[] tree.

 After finishing the revision summaries, the parser invokes
 CollectData.tree_completed(), which loops over the revisions in
 self.rev_data, determining if there are instances where a higher
 revision was committed "before" a lower one (rare, but it can happen
 when there was clock skew on the repository machine).  If there are
 any, it "resyncs" the timestamp of the higher rev to be just after
 that of the lower rev, but saves the original timestamp in
 self.rev_data[blah][3], so we can later write out a record to the
 resync file indicating that an adjustment was made (this makes it
 possible to catch the other parts of this commit and resync them
 similarly, more details below).

 Next, the parser encounters the *real* revision data, which has the
 log messages and file contents.  For each revision, it invokes
 CollectData.set_revision_info(), which writes a new line to
 cvs2svn-data.revs, like this:

    3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 C 1.2 * 0 0 foo/bar,v

 The fields are:

    1. a fixed-width timestamp
    2. a digest of the log message + author
    3. the type of change ("C"hange, or "D"elete)
    4. the revision number
    5. the branch on which this commit happened, or "*" if not on a branch
    6. the number of tags rooted at this revision (followed by their
         names, space-delimited)
    7. the number of branches rooted at this revision (followed by
         their names, space-delimited)
    8. the path of the RCS file in the repository

 (Of course, in the above example, fields 6 and 7 are "0", so they have
 no additional data.)

 Also, for resync'd revisions, a line like this is written out to
 'cvs2svn-data.resync':

    3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328

 The fields are:

    NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP

 (The resync file will be explained later.)

 That's it -- the RCS file is done.

 When every RCS file is done, Pass 1 is complete, and:

    - cvs2svn-data.revs contains a summary of every RCS file's
      revisions.  All the revisions for a given RCS file are grouped
      together, but note that the groups are in no particular order.
      In other words, you can't yet identify the commits from looking
      at these lines; a multi-file commit will be scattered all over
      the place.

    - cvs2svn-data.resync contains a small amount of resync data, in
      no particular order.

 Pass 2:
 =======

 This is where the resync file is used.  The goal of this pass is to
 convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
 revs).  It's the same as the original file, except for some resync'd
 timestamps.

 First, read the whole resync file into a hash table that maps each
 author+log digest to a list of lists.  Each sublist represents one of
 the timestamp adjustments from Pass 1, and looks like this:

    [old_time_lower, old_time_upper, new_time]

 The reason to map each digest to a list of sublists, instead of to one
 list, is that sometimes you'll get the same digest for unrelated
 commits (for example, the same author commits many times using the
 empty log message, or a log message that just says "Doc tweaks.").  So
 each digest may need to "fan out" to cover multiple commits, but
 without accidentally unifying those commits.

 Now we loop over cvs2svn-data.revs, writing each line out to
 'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
 those whose digest matches some resync entry, and appear to be part of
 the same commit as one of the sublists in that entry, get tweaked.
 The tweak is to adjust the commit time of the line to the new_time,
 which is taken from the resync hash and results from the adjustment
 described in Pass 1.

 The way we figure out whether a given line needs to be tweaked is to
 loop over all the sublists, seeing if this commit's original time
 falls within the old<-->new time range for the current sublist.  If it
 does, we tweak the line before writing it out, and then conditionally
 adjust the sublist's range to account for the timestamp we just
 adjusted (since it could be an outlier).  Note that this could, in
 theory, result in separate commits being accidentally unified, since
 we might gradually the two sides of the range such that they are
 eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
 really a case of CVS not recording enough information to disambiguate
 the commits; we'd know we have a time range that exceeds the
 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
 up.  We could try some clever heuristic, but for now it's not
 important -- after all, we're talking about commits that weren't
 important enough to have a distinctive log message anyway, so does it
 really matter if a couple of them accidentally get unified?  Probably
 not.

 Pass 3:
 =======

 This is where we deduce the changesets, that is, the grouping of file
 changes into single commits.

 It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
 to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
 this causes commits with the same digest (that is, the same author and
 log message) to be grouped together.  Poof!  We now have the CVS
 changes grouped by logical commit.

 In some cases, the changes in a given commit may be interleaved with
 other commits that went on at the same time, because the sort gives
 precedence to date before log digest.  However, Pass 4 detects this by
 seeing that the log digest is different, and reseparates the commits.

 Pass 4:
 =======

 Walk through cvs2svn-data.s-revs and print the commits to a Subversion
 dumpfile (a file intended for 'svnadmin load').  The dumpfile is the
 data's last static stage: last chance to check over the data, run it
 through svndumpfilter, move the dumpfile to another machine, etc.

                   ===============================
                       Branches and Tags Plan.
                   ===============================

 This pass is also where tag and branch creation is done.  Since
 subversion does tags and branches by copying from existing revisions
 (then maybe editing the copy, making subcopies underneath, etc), the
 big question for cvs2svn is how to achieve the minimum number of
 operations per creation.  For example, if it's possible to get the
 right tag by just copying revision 53, then it's better to do that
 than, say, copying revision 51 and then sub-copying in bits of
 revision 52 and 53.

 Also, since CVS does not version symbolic names, there is the
 secondary question of *when* to create a particular tag or branch.
 For example, a tag might have been made at any time after the youngest
 commit included in it, or might even have been made piecemeal; and the
 same is true for a branch, with the added constraint that for any
 particular file, the branch must have been created before the first
 commit on the branch.

 Answering the second question first: cvs2svn creates tags and branches
 as late as possible.  For branches, this is "just in time" creation --
 the moment it sees the first commit on a branch, it snaps the entire
 branch into existence (or as much of it as possible), and then outputs
 the branch commit.

 The reason we say "as much of it as possible" is that it's possible to
 have a branch where some files have branch commits occuring earlier
 than the other files even have the source revisions from which the
 branch sprouts (this can happen if the branch was created piecemeal,
 for example).  In this case, we create as much of the branch as we
 can, that is, as much of it as there are source revisions available to
 copy, and leave the rest for later.  "Later" might mean just until
 other branch commits come in, or else during a cleanup stage that
 happens at the end of this pass (about which more later).

 All tags are created during the cleanup stage, after all regular
 commits have been made.  That way there's no need to worry whether all
 the required revisions for a particular tag have been committed yet,
 and it's as correct as any other time, since no one can tell when a
 tag was made anyway.

 How just-in-time branch creation works:

 In order to make the "best" set of copies/deletes when creating a
 branch, cvs2svn keeps track of two sets of trees while it's making
 commits:

    1. A skeleton mirror of the subversion repository, that is, an
       array of revisions, with a tree hanging off each revision.  (The
       "array" is actually implemented as an anydbm database itself,
       mapping string representations of numbers to root keys.)

    2. A tree for each CVS symbolic name, and the svn file/directory
       revisions from which various parts of that tree could be copied.

 Both tree sets live in anydbm databases, using the same basic schema:
 unique keys map to marshal.dumps() representations of dictionaries,
 which in turn map entry names to other unique keys:

    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
    entrykey1 ==> { entrynameX : entrykeyX, ... }
    entrykey2 ==> { entrynameY : entrykeyY, ... }
    entrykeyX ==> { etc, etc ...}
    entrykeyY ==> { etc, etc ...}

 (The leaf nodes -- files -- are also dictionaries, for simplicity.)

 Both file and directory dictionaries store metadata under special keys
 whose names start with "/", so they can always be distinguished from
 entries (for example, search for "/mutable", "/openings", or
 "/closings" in cvs2svn.py).

 The repository mirror allows cvs2svn to remember what paths exist in
 what revisions.  For each file path in a revision, it records what
 tags and branches can sprout from that revision; when the file
 changes, these attributes do not propagate to the new revision, since
 the symbolic name isn't based on that revision.

 The symbolic name trees are all stored in one db file, as paths, where
 the first element in each path is the symbolic name, and the rest is
 the full Subversion path to the file in question.  For example, if the
 Subversion revision 7 is the root of branch 'Rel_1', this fact would
 be recorded under the path

    '/Rel_1/myproj/trunk/lib/driver.c'

 (the exact layout is dependent on the make_path() function in
 cvs2svn.py, which may change).

    root_key  ==> { 'Rel_1' : 'a', ... }
    'a'       ==> { 'myproj' : 'b', ... }
    'b'       ==> { 'trunk : 'c', ... }
    'c'       ==> { 'lib' : 'd', ... }
    'd'       ==> { 'driver.c' : 'e', ... }
    'e'       ==> { }

 The source revision is stored in the leaf node, and also in all the
 parent nodes, in the manner described in the class documentation for
 'SymbolicNameTracker'.  The special entries "/opening" and "/closing"
 are not shown above, for brevity, but their values are where the
 revision ranges are stored (that is, the ranges indicating when this
 path could be copied from to produce the tag or branch in question).

 When it's time to create a branch or tag, cvs2svn.py walks the
 appropriate symbolic name tree, calculating the ideal source revision
 for each subpath (see 'SymbolicNameTracker' for the exact algorithm)
 and emitting the minimum number of copies to the dumpfile and to the
 skeleton repository mirror.  As it goes, it marks each path as
 emitted, so that we don't redo the same copies during the cleanup
 phase later on.

 At this point, the entire branch is done except for:

    1. Any source revisions that haven't yet been committed (this is
       a rare situation, but anyway such revisions will automatically
       be handled later by the same algorithm, invoked either due to
       another commit on the branch, or in the cleanup phase), and

    2. Files that were accidentally copied onto the branch as part of a
       subtree, but which don't actually belong on the branch, because
       the corresponding CVS file doesn't contain that tag.

 We handle (2) by doing tree diffs between the newly copied tree in the
 skeleton repository mirror, and the corresponding portion of the
 symbolic name tree.  If the skeleton mirror has a file that's not in
 the symbolic name tree, we emit a delete to the dumpfile and remove
 that path from the skeleton mirror.

 The cleanup phase happens after all regular changes have been
 processed.  Just loop over the "root directory" of the symbolic name
 tree, running the same creation algorithm on each name (we'll have to
 distinguish between branches and tags, probably through a special
 entry on the directory object), skipping parts of the tree already
 marked as copied.


 Pass 5:
 =======

 Load the dumpfile into Subversion.  Voilà.


 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-

 Some older notes and ideas about cvs2svn.  Not deleted, because they
 may contain suggestions for future improvements in design.

 -----------------------------------------------------------------------

 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 considerations for the tool.

 ------
 From: John Gardiner Myers <jgmyers@speakeasy.net>
 Subject: Thoughts on CVS to SVN conversion
 To: gstein@lyra.org
 Date: Sun, 15 Apr 2001 17:47:10 -0700

 Some things you may want to consider for a CVS to SVN conversion utility:

 If converting a CVS repository to SVN takes days, it would be good for
 the conversion utility to keep its progress state on disk.  If the
 conversion fails halfway through due to a network outage or power
 failure, that would allow the conversion to be resumed where it left off
 instead of having to start over from an empty SVN repository.

 It is a short step from there to allowing periodic updates of a
 read-only SVN repository from a read/write CVS repository.  This allows
 the more relaxed conversion procedure:

 1) Create SVN repository writable only by the conversion tool.
 2) Update SVN repository from CVS repository.
 3) Announce the time of CVS to SVN cutover.
 4) Repeat step (2) as needed.
 5) Disable commits to CVS repository, making it read-only.
 6) Repeat step (2).
 7) Enable commits to SVN repository.
 8) Wait for developers to move their workspaces to SVN.
 9) Decomission the CVS repository.

 You may forward this message or parts of it as you seem fit.
 ------

 -----------------------------------------------------------------------

 Further design thoughts from Greg Stein <gstein@lyra.org>

 * timestamp the beginning of the process. ignore any commits that
   occur after that timestamp; otherwise, you could miss portions of a
   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
   revision for items in B; we missed A)

 * the above timestamp can also be used for John's "grab any updates
   that were missed in the previous pass."

 * for each file processed, watch out for simultaneous commits. this
   may cause a problem during the reading/scanning/parsing of the file,
   or the parse succeeds but the results are garbaged. this could be
   fixed with a CVS lock, but I'd prefer read-only access.

   algorithm: get the mtime before opening the file. if an error occurs
   during reading, and the mtime has changed, then restart the file. if
   the read is successful, but the mtime changed, then restart the
   file.

 * use a separate log to track unique branches and non-branched forks
   of revision history (Q: is it possible to create, say, 1.4.1.3
   without a "real" branch?). this log can then be used to create a
   /branches/ directory in the SVN repository.

   Note: we want to determine some way to coalesce branches across
   files. It can't be based on name, though, since the same branch name
   could be used in multiple places, yet they are semantically
   different branches. Given files R, S, and T with branch B, we can
   tie those files' branch B into a "semantic group" whenever we see
   commit groups on a branch touching multiple files. Files that are
   have a (named) branch but no commits on it are simply ignored. For
   each "semantic group" of a branch, we'd create a branch based on
   their common ancestor, then make the changes on the children as
   necessary. For single-file commits to a branch, we could use
   heuristics (pathname analysis) to add these to a group (and log what
   we did), or we could put them in a "reject" kind of file for a human
   to tell us what to do (the human would edit a config file of some
   kind to instruct the converter).

 * if we have access to the CVSROOT/history, then we could process tags
   properly. otherwise, we can only use heuristics or configuration
   info to group up tags (branches can use commits; there are no
   commits associated with tags)

 * ideally, we store every bit of data from the ,v files to enable a
   complete restoration of the CVS repository. this could be done by
   storing properties with CVS revision numbers and stuff (i.e. all
   metadata not already embodied by SVN would go into properties)

 * how do we track the "states"? I presume "dead" is simply deleting
   the entry from SVN. what are the other legal states, and do we need
   to do anything with them?

 * where do we put the "description"? how about locks, access list,
   keyword flags, etc.

 * note that using something like the SourceForge repository will be an
   ideal test case. people *move* their repositories there, which means
   that all kinds of stuff can be found in those repositories, from
   wherever people used to run them, and under whatever development
   policies may have been used.

   For example: I found one of the projects with a "permissions 644;"
   line in the "gnuplot" repository. Most RCS releases issue warnings
   about that (although they properly handle/skip the lines).
	How cvs2svn.py Works
	=====================

	A cvs2svn run consists of 5 passes. Every pass but the last saves its
	data to a file on disk, so that a) we don't hold huge amounts of state
	in memory, and b) the conversion process is resumable. The final pass
	makes the actual Subversion commits.

	Pass 1:
	=======

	The goal of this pass is to get a summary of all the revisions for
	each file written out to 'cvs2svn-data.revs'; at the end of this
	stage, revisions will be grouped by RCS file, not by logical commits.

	We walk over the repository, processing each RCS file with with
	rcsparse.parse(), using cvs2svn's CollectData class, which is a
	subclass of rcsparse.Sink(), the parser's callback class. For each
	RCS file, the first thing the parser encounters is the administrative
	header, including the head revision, the principal branch, symbolic
	names, RCS comments, etc. The main thing that happens here is that
	CollectData.define_tag() is invoked on each symbolic name and its
	attached revision, so all the tags and branches of this file get
	collected.

	Next, the parser hits the revision summary section. That's the part
	of the RCS file that looks like this:

	1.6
	date 2002.06.12.04.54.12; author captnmark; state Exp;
	branches
	1.6.2.1;
	next 1.5;

	1.5
	date 2002.05.28.18.02.11; author captnmark; state Exp;
	branches;
	next 1.4;

	[...]

	For each revision summary, CollectData.define_revision() is invoked,
	recording that revision's metadata in the self.rev_data[] tree.

	After finishing the revision summaries, the parser invokes
	CollectData.tree_completed(), which loops over the revisions in
	self.rev_data, determining if there are instances where a higher
	revision was committed "before" a lower one (rare, but it can happen
	when there was clock skew on the repository machine). If there are
	any, it "resyncs" the timestamp of the higher rev to be just after
	that of the lower rev, but saves the original timestamp in
	self.rev_data[blah][3], so we can later write out a record to the
	resync file indicating that an adjustment was made (this makes it
	possible to catch the other parts of this commit and resync them
	similarly, more details below).

	Next, the parser encounters the real revision data, which has the
	log messages and file contents. For each revision, it invokes
	CollectData.set_revision_info(), which writes a new line to
	cvs2svn-data.revs, like this:

	3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 C 1.2 * 0 0 foo/bar,v

	The fields are:

	1. a fixed-width timestamp
	2. a digest of the log message + author
	3. the type of change ("C"hange, or "D"elete)
	4. the revision number
	5. the branch on which this commit happened, or "*" if not on a branch
	6. the number of tags rooted at this revision (followed by their
	names, space-delimited)
	7. the number of branches rooted at this revision (followed by
	their names, space-delimited)
	8. the path of the RCS file in the repository

	(Of course, in the above example, fields 6 and 7 are "0", so they have
	no additional data.)

	Also, for resync'd revisions, a line like this is written out to
	'cvs2svn-data.resync':

	3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328

	The fields are:

	NEW_TIMESTAMP DIGEST OLD_TIMESTAMP

	(The resync file will be explained later.)

	That's it -- the RCS file is done.

	When every RCS file is done, Pass 1 is complete, and:

	- cvs2svn-data.revs contains a summary of every RCS file's
	revisions. All the revisions for a given RCS file are grouped
	together, but note that the groups are in no particular order.
	In other words, you can't yet identify the commits from looking
	at these lines; a multi-file commit will be scattered all over
	the place.

	- cvs2svn-data.resync contains a small amount of resync data, in
	no particular order.

	Pass 2:
	=======

	This is where the resync file is used. The goal of this pass is to
	convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
	revs). It's the same as the original file, except for some resync'd
	timestamps.

	First, read the whole resync file into a hash table that maps each
	author+log digest to a list of lists. Each sublist represents one of
	the timestamp adjustments from Pass 1, and looks like this:

	[old_time_lower, old_time_upper, new_time]

	The reason to map each digest to a list of sublists, instead of to one
	list, is that sometimes you'll get the same digest for unrelated
	commits (for example, the same author commits many times using the
	empty log message, or a log message that just says "Doc tweaks."). So
	each digest may need to "fan out" to cover multiple commits, but
	without accidentally unifying those commits.

	Now we loop over cvs2svn-data.revs, writing each line out to
	'cvs2svn-data.c-revs'. Most lines are written out unchanged, but
	those whose digest matches some resync entry, and appear to be part of
	the same commit as one of the sublists in that entry, get tweaked.
	The tweak is to adjust the commit time of the line to the new_time,
	which is taken from the resync hash and results from the adjustment
	described in Pass 1.

	The way we figure out whether a given line needs to be tweaked is to
	loop over all the sublists, seeing if this commit's original time
	falls within the old<-->new time range for the current sublist. If it
	does, we tweak the line before writing it out, and then conditionally
	adjust the sublist's range to account for the timestamp we just
	adjusted (since it could be an outlier). Note that this could, in
	theory, result in separate commits being accidentally unified, since
	we might gradually the two sides of the range such that they are
	eventually more than COMMIT_THRESHOLD seconds apart. However, this is
	really a case of CVS not recording enough information to disambiguate
	the commits; we'd know we have a time range that exceeds the
	COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
	up. We could try some clever heuristic, but for now it's not
	important -- after all, we're talking about commits that weren't
	important enough to have a distinctive log message anyway, so does it
	really matter if a couple of them accidentally get unified? Probably
	not.

	Pass 3:
	=======

	This is where we deduce the changesets, that is, the grouping of file
	changes into single commits.

	It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
	to 'cvs2svn-data.s-revs'. Because of the way the data is laid out,
	this causes commits with the same digest (that is, the same author and
	log message) to be grouped together. Poof! We now have the CVS
	changes grouped by logical commit.

	In some cases, the changes in a given commit may be interleaved with
	other commits that went on at the same time, because the sort gives
	precedence to date before log digest. However, Pass 4 detects this by
	seeing that the log digest is different, and reseparates the commits.

	Pass 4:
	=======

	Walk through cvs2svn-data.s-revs and print the commits to a Subversion
	dumpfile (a file intended for 'svnadmin load'). The dumpfile is the
	data's last static stage: last chance to check over the data, run it
	through svndumpfilter, move the dumpfile to another machine, etc.

	===============================
	Branches and Tags Plan.
	===============================

	This pass is also where tag and branch creation is done. Since
	subversion does tags and branches by copying from existing revisions
	(then maybe editing the copy, making subcopies underneath, etc), the
	big question for cvs2svn is how to achieve the minimum number of
	operations per creation. For example, if it's possible to get the
	right tag by just copying revision 53, then it's better to do that
	than, say, copying revision 51 and then sub-copying in bits of
	revision 52 and 53.

	Also, since CVS does not version symbolic names, there is the
	secondary question of when to create a particular tag or branch.
	For example, a tag might have been made at any time after the youngest
	commit included in it, or might even have been made piecemeal; and the
	same is true for a branch, with the added constraint that for any
	particular file, the branch must have been created before the first
	commit on the branch.

	Answering the second question first: cvs2svn creates tags and branches
	as late as possible. For branches, this is "just in time" creation --
	the moment it sees the first commit on a branch, it snaps the entire
	branch into existence (or as much of it as possible), and then outputs
	the branch commit.

	The reason we say "as much of it as possible" is that it's possible to
	have a branch where some files have branch commits occuring earlier
	than the other files even have the source revisions from which the
	branch sprouts (this can happen if the branch was created piecemeal,
	for example). In this case, we create as much of the branch as we
	can, that is, as much of it as there are source revisions available to
	copy, and leave the rest for later. "Later" might mean just until
	other branch commits come in, or else during a cleanup stage that
	happens at the end of this pass (about which more later).

	All tags are created during the cleanup stage, after all regular
	commits have been made. That way there's no need to worry whether all
	the required revisions for a particular tag have been committed yet,
	and it's as correct as any other time, since no one can tell when a
	tag was made anyway.

	How just-in-time branch creation works:

	In order to make the "best" set of copies/deletes when creating a
	branch, cvs2svn keeps track of two sets of trees while it's making
	commits:

	1. A skeleton mirror of the subversion repository, that is, an
	array of revisions, with a tree hanging off each revision. (The
	"array" is actually implemented as an anydbm database itself,
	mapping string representations of numbers to root keys.)

	2. A tree for each CVS symbolic name, and the svn file/directory
	revisions from which various parts of that tree could be copied.

	Both tree sets live in anydbm databases, using the same basic schema:
	unique keys map to marshal.dumps() representations of dictionaries,
	which in turn map entry names to other unique keys:

	root_key ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
	entrykey1 ==> { entrynameX : entrykeyX, ... }
	entrykey2 ==> { entrynameY : entrykeyY, ... }
	entrykeyX ==> { etc, etc ...}
	entrykeyY ==> { etc, etc ...}

	(The leaf nodes -- files -- are also dictionaries, for simplicity.)

	Both file and directory dictionaries store metadata under special keys
	whose names start with "/", so they can always be distinguished from
	entries (for example, search for "/mutable", "/openings", or
	"/closings" in cvs2svn.py).

	The repository mirror allows cvs2svn to remember what paths exist in
	what revisions. For each file path in a revision, it records what
	tags and branches can sprout from that revision; when the file
	changes, these attributes do not propagate to the new revision, since
	the symbolic name isn't based on that revision.

	The symbolic name trees are all stored in one db file, as paths, where
	the first element in each path is the symbolic name, and the rest is
	the full Subversion path to the file in question. For example, if the
	Subversion revision 7 is the root of branch 'Rel_1', this fact would
	be recorded under the path

	'/Rel_1/myproj/trunk/lib/driver.c'

	(the exact layout is dependent on the make_path() function in
	cvs2svn.py, which may change).

	root_key ==> { 'Rel_1' : 'a', ... }
	'a' ==> { 'myproj' : 'b', ... }
	'b' ==> { 'trunk : 'c', ... }
	'c' ==> { 'lib' : 'd', ... }
	'd' ==> { 'driver.c' : 'e', ... }
	'e' ==> { }

	The source revision is stored in the leaf node, and also in all the
	parent nodes, in the manner described in the class documentation for
	'SymbolicNameTracker'. The special entries "/opening" and "/closing"
	are not shown above, for brevity, but their values are where the
	revision ranges are stored (that is, the ranges indicating when this
	path could be copied from to produce the tag or branch in question).

	When it's time to create a branch or tag, cvs2svn.py walks the
	appropriate symbolic name tree, calculating the ideal source revision
	for each subpath (see 'SymbolicNameTracker' for the exact algorithm)
	and emitting the minimum number of copies to the dumpfile and to the
	skeleton repository mirror. As it goes, it marks each path as
	emitted, so that we don't redo the same copies during the cleanup
	phase later on.

	At this point, the entire branch is done except for:

	1. Any source revisions that haven't yet been committed (this is
	a rare situation, but anyway such revisions will automatically
	be handled later by the same algorithm, invoked either due to
	another commit on the branch, or in the cleanup phase), and

	2. Files that were accidentally copied onto the branch as part of a
	subtree, but which don't actually belong on the branch, because
	the corresponding CVS file doesn't contain that tag.

	We handle (2) by doing tree diffs between the newly copied tree in the
	skeleton repository mirror, and the corresponding portion of the
	symbolic name tree. If the skeleton mirror has a file that's not in
	the symbolic name tree, we emit a delete to the dumpfile and remove
	that path from the skeleton mirror.

	The cleanup phase happens after all regular changes have been
	processed. Just loop over the "root directory" of the symbolic name
	tree, running the same creation algorithm on each name (we'll have to
	distinguish between branches and tags, probably through a special
	entry on the directory object), skipping parts of the tree already
	marked as copied.


	Pass 5:
	=======

	Load the dumpfile into Subversion. Voilà.




	-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

	Some older notes and ideas about cvs2svn. Not deleted, because they
	may contain suggestions for future improvements in design.

	-----------------------------------------------------------------------

	An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
	considerations for the tool.

	------
	From: John Gardiner Myers <jgmyers@speakeasy.net>
	Subject: Thoughts on CVS to SVN conversion
	To: gstein@lyra.org
	Date: Sun, 15 Apr 2001 17:47:10 -0700

	Some things you may want to consider for a CVS to SVN conversion utility:

	If converting a CVS repository to SVN takes days, it would be good for
	the conversion utility to keep its progress state on disk. If the
	conversion fails halfway through due to a network outage or power
	failure, that would allow the conversion to be resumed where it left off
	instead of having to start over from an empty SVN repository.

	It is a short step from there to allowing periodic updates of a
	read-only SVN repository from a read/write CVS repository. This allows
	the more relaxed conversion procedure:

	1) Create SVN repository writable only by the conversion tool.
	2) Update SVN repository from CVS repository.
	3) Announce the time of CVS to SVN cutover.
	4) Repeat step (2) as needed.
	5) Disable commits to CVS repository, making it read-only.
	6) Repeat step (2).
	7) Enable commits to SVN repository.
	8) Wait for developers to move their workspaces to SVN.
	9) Decomission the CVS repository.

	You may forward this message or parts of it as you seem fit.
	------

	-----------------------------------------------------------------------

	Further design thoughts from Greg Stein <gstein@lyra.org>

	* timestamp the beginning of the process. ignore any commits that
	occur after that timestamp; otherwise, you could miss portions of a
	commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
	revision for items in B; we missed A)

	* the above timestamp can also be used for John's "grab any updates
	that were missed in the previous pass."

	* for each file processed, watch out for simultaneous commits. this
	may cause a problem during the reading/scanning/parsing of the file,
	or the parse succeeds but the results are garbaged. this could be
	fixed with a CVS lock, but I'd prefer read-only access.

	algorithm: get the mtime before opening the file. if an error occurs
	during reading, and the mtime has changed, then restart the file. if
	the read is successful, but the mtime changed, then restart the
	file.

	* use a separate log to track unique branches and non-branched forks
	of revision history (Q: is it possible to create, say, 1.4.1.3
	without a "real" branch?). this log can then be used to create a
	/branches/ directory in the SVN repository.

	Note: we want to determine some way to coalesce branches across
	files. It can't be based on name, though, since the same branch name
	could be used in multiple places, yet they are semantically
	different branches. Given files R, S, and T with branch B, we can
	tie those files' branch B into a "semantic group" whenever we see
	commit groups on a branch touching multiple files. Files that are
	have a (named) branch but no commits on it are simply ignored. For
	each "semantic group" of a branch, we'd create a branch based on
	their common ancestor, then make the changes on the children as
	necessary. For single-file commits to a branch, we could use
	heuristics (pathname analysis) to add these to a group (and log what
	we did), or we could put them in a "reject" kind of file for a human
	to tell us what to do (the human would edit a config file of some
	kind to instruct the converter).

	* if we have access to the CVSROOT/history, then we could process tags
	properly. otherwise, we can only use heuristics or configuration
	info to group up tags (branches can use commits; there are no
	commits associated with tags)

	* ideally, we store every bit of data from the ,v files to enable a
	complete restoration of the CVS repository. this could be done by
	storing properties with CVS revision numbers and stuff (i.e. all
	metadata not already embodied by SVN would go into properties)

	* how do we track the "states"? I presume "dead" is simply deleting
	the entry from SVN. what are the other legal states, and do we need
	to do anything with them?

	* where do we put the "description"? how about locks, access list,
	keyword flags, etc.

	* note that using something like the SourceForge repository will be an
	ideal test case. people move their repositories there, which means
	that all kinds of stuff can be found in those repositories, from
	wherever people used to run them, and under whatever development
	policies may have been used.

	For example: I found one of the projects with a "permissions 644;"
	line in the "gnuplot" repository. Most RCS releases issue warnings
	about that (although they properly handle/skip the lines).