tools/cvs2svn/design-notes.txt - subversion - Git at Google

 I am starting this file on 9/22/2000, before a single token of code
 for cvs2svn has been written.  In the future, I may change parts of
 it from future tense to present or past.  -- K<bob>

 			     ===========
 			     ASSUMPTIONS
 			     ===========

 I will state some assumptions up front that don't really seem worthy
 of elaborate justification right now.

 cvs2svn will be a Unix command line tool.  No GUI.  No HTTP interface.
 No bonoboification.  Not an emacs macro.

 cvs2svn's only purpose in life is to convert a CVS repository to an
 subversion repository.

 cvs2svn will only be able to do a big-bang, whole repository
 conversion.  It will not be able to sync or merge an existing CVS
 repository with an existing SVN repository.  After all, solving that
 problem is just as hard as writing SVN. (-: *MAYBE* it'll be desirable
 to subset the CVS repository based on dates or directories or vendor
 branches or something, but that will not be a future development.

 				=====
 				PLANS
 				=====

 In this section, I ask some design questions, discuss pros and cons,
 and present my conclusion.

 ------------------------------------------------------------------------

 Question:

 	Should cvs2svn read the CVS repository directly or should it
 	use the cvs client to access it?

 Thoughts:

 	If we read the repository directly, we couple ourselves
 	tightly to past, present, and future versions of CVS.  That is
 	bad.

 	There may be ambiguity in `cvs log' output, which wouldn't be
 	there in the RCS files.

 	Using cvs allows conversion of remote repositories with no
 	extra effort.

 Conclusion:

 	All CVS repository operations will be done through the cvs
 	command, if possible.

 ------------------------------------------------------------------------

 Question:

 	In what language(s) should cvs2svn be written?

 Thoughts:

 	Ref. Alternate Hard and Soft Layers.
 	http://c2.com/cgi/wiki/wiki?AlternateHardAndSoftLayers

 	The rest of the subversion project is written in C.

 	Karl has written a Perl script called cvs2cl.pl which
 	parses `cvs log' output and pattern-matches similar
 	file-granularity log messages into single commits.  I'd
 	like to reuse as much of that code as possible.
 	Rewriting that in C would be tedious.

 	Other soft languages, e.g., Python, sh, scheme, Visual
 	Basic ( :-) ), exist, but (a) I'm fluent in Perl, (b)
 	Perl has a huge installed base (but so does sh), (c)
 	Karl's script is in Perl.

 	Is it okay to make perl a prerequisite?  The svn client
 	doesn't require Perl.

 	To write the subversion repository, we should call libsvn_fs
 	through the svn_delta_edit_fns_t switch.  That requires C.

 Conclusion:

 	main in Perl.  cvs log parsing and commit aggregating in Perl.
 	Subversion access in C through the svn_delta_edit_fns_t switch.

 	OR...  write the whole thing in Perl, and call both the cvs
 	client and the svn client through fork/exec or popen.

 ------------------------------------------------------------------------

 Question:

 	Assuming there's a Perl component and a C component,
 	how do they communicate?

 Thoughts:

 	We could use XS or swig to call C from Perl.  At some point, a
 	Perl zealot may want to add that so he can write a perl
 	subversion client.  I am not that Perl zealot.

 	Pipes are nice.  I don't know yet whether the C part needs
 	bidirectional communication with the perl driver.  If it does,
 	we could let the driver take over both stdin and stdout of the
 	C part.

 	I have written code that uses popen (unidirectionally) on
 	Windows NT, so pipes are not completely nonportable to Satan's
 	OS.

 Conclusion:

 	The Perl part forks and execs the C part, and talks to it
 	through a pipe.  The C part may talk back through another
 	pipe.

 	TBD whether the C part is exec'd once or once per commit or ...

 ------------------------------------------------------------------------

 Question:

 	Assuming there's a Perl component, which version of Perl is
 	required?

 Thoughts:

 	I'll test it with one of the 5.003 series perls, as well as
 	the current 5.6.x version.

 	I'll refrain from using anything from CPAN that hasn't
 	been in the core Perl distribution (and stable) for quite
 	a while.

 	I don't eat Perl for breakfast, so I may make mistakes here...
 	I do know that some of the machines I use don't support \A and
 	\z in regexps yet.

 Conclusion:

 	Use as old a Perl as possible.  Test with back revs.

 	Don't use non-core packages (or, if I do, I'll include them
 	inline.)

 ------------------------------------------------------------------------

 Question: What are the command line arguments?

 Thoughts:

 	For now, I want a minimal set.  Later, we can add bells and
 	whistles and one of those little stripey things that bounces
 	up and down and changes colors while it rotates.

 Conclusion:

 	Use: cvs2svn [ -d cvs-repository-spec ] new-subversion-root-dir

 	Use $CVSROOT if it's there (and `-d ...' isn't).

 	Require new-subversion-root-dir to be nonexistent (but its parent
 	must exist).

 				======
 				ISSUES
 				======

 In this section, I ask some design questions and discuss pros and
 cons.  I don't present a solution, because I don't see a best solution
 yet.

 ------------------------------------------------------------------------

 Question:

 	What kind(s) and format(s) of user documentation do we want?
 	Where is it in the tree?

 Thoughts:

 	What's the rest of the project going to use?  (Am I the first
 	to ask this question?)

 	A good man page (whether troff or another format) should
 	suffice -- I don't see us needing separate user manual.

 ------------------------------------------------------------------------

 Question:

 	What errors will cvs report, and how will it report them?

 Thoughts:

 	Standard error is my friend.

 	The nature of the errors will become obvious as the code
 	is written.

 ------------------------------------------------------------------------

 Question:

 	Should cvs2svn write the Subversion repository directly or
 	should it use the svn client to access it?

 Thoughts:

 	It seems like it would be easy to use cvs and svn.
 	The whole program would be (in pseudocode):

 		commits = read_cvs_logs_and_figure_out_what_to_do();
 		for commits {
 			cvs checkout ...
 			svn commit ...
 		}

 	The svn client isn't written yet.  That's a problem.

 	The svn client looks in the filesystem at the working copy.
 	If cvs2svn calls libsvn_fs directly, we don't need a working
 	copy, and we can skip the filesystem I/O.  (How much does this
 	matter?)

 	How would we override date stamps and user names (and what
 	else?) on svn commits?

 	I wonder how far Karl and Ben (or anybody else) has thought
 	through what the svn client's command line is.

 ------------------------------------------------------------------------

 Question:

 	How hard is it to determine the shape of the repository
 	from the `cvs log' output?

 Thoughts:

 	If it isn't obvious yet, I am not a CVS power user.  I'm
 	trying to build a toy repository right now so I can play with
 	it and get a better feeling.

 	Since CVS doesn't support renaming files directly, I expect to
 	need some ad-hocu-ery to recognize renames.

 	What happens if the log message contains:
 		my text here
 		----------------------------
 		revision 1.2
 		date: 2000/01/10 00:00:00;  author: fred;  state: Exp;  lines: +2 -9
 		my other text here
 	In other words, a log message contains what looks like a
 	log message header.

 ------------------------------------------------------------------------

 Question:

 	How do we handle the Attic?

 Question:

 	How does subversion represent branches and branch names?

 Question:

 	How does subversion represent symbolic tags?

 Question:

 	What do we do with the files in CVSROOT?

 ------------------------------------------------------------------------

 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 considerations for the tool.

 ------
 From: John Gardiner Myers <jgmyers@speakeasy.net>
 Subject: Thoughts on CVS to SVN conversion
 To: gstein@lyra.org
 Date: Sun, 15 Apr 2001 17:47:10 -0700

 Some things you may want to consider for a CVS to SVN conversion utility:

 If converting a CVS repository to SVN takes days, it would be good for
 the conversion utility to keep its progress state on disk.  If the
 conversion fails halfway through due to a network outage or power
 failure, that would allow the conversion to be resumed where it left off
 instead of having to start over from an empty SVN repository.

 It is a short step from there to allowing periodic updates of a
 read-only SVN repository from a read/write CVS repository.  This allows
 the more relaxed conversion procedure:

 1) Create SVN repository writable only by the conversion tool.
 2) Update SVN repository from CVS repository.
 3) Announce the time of CVS to SVN cutover.
 4) Repeat step (2) as needed.
 5) Disable commits to CVS repository, making it read-only.
 6) Repeat step (2).
 7) Enable commits to SVN repository.
 8) Wait for developers to move their workspaces to SVN.
 9) Decomission the CVS repository.

 You may forward this message or parts of it as you seem fit.
 ------

 ------------------------------------------------------------------------

 Further design thoughts from Greg Stein <gstein@lyra.org>

 * timestamp the beginning of the process. ignore any commits that
   occur after that timestamp; otherwise, you could miss portions of a
   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
   revision for items in B; we missed A)

 * the above timestamp can also be used for John's "grab any updates
   that were missed in the previous pass."

 * for each file processed, watch out for simultaneous commits. this
   may cause a problem during the reading/scanning/parsing of the file,
   or the parse succeeds but the results are garbaged. this could be
   fixed with a CVS lock, but I'd prefer read-only access.

   algorithm: get the mtime before opening the file. if an error occurs
   during reading, and the mtime has changed, then restart the file. if
   the read is successful, but the mtime changed, then restart the
   file.

 * dump file metadata to a separate log file(s). in particular, we want
   the following items for each commit:
   - MD5 hash of the commit message
   - author
   - timestamp

   The above three items are used to coalesce the commit. Remember to
   use a fudge factor for the timestamp. (the fudge cannot be fixed
   because a commit could occur over an arbitrary length of time, based
   on size of commit and the network connection used for the commit;
   figure out an algorithm here)

   All other metadata needs to be preserved, but that can probably
   happen when we re-read the file to generate the SVN revisions.

   We would sort the log file generated above (GNU sort can handle
   arbitrarily large files). Then scan the file progressively,
   generating the commit groups.

 * use a separate log to track unique branches and non-branched forks
   of revision history (Q: is it possible to create, say, 1.4.1.3
   without a "real" branch?). this log can then be used to create a
   /branches/ directory in the SVN repository.

   Note: we want to determine some way to coalesce branches across
   files. It can't be based on name, though, since the same branch name
   could be used in multiple places, yet they are semantically
   different branches. Given files R, S, and T with branch B, we can
   tie those files' branch B into a "semantic group" whenever we see
   commit groups on a branch touching multiple files. Files that are
   have a (named) branch but no commits on it are simply ignored. For
   each "semantic group" of a branch, we'd create a branch based on
   their common ancestor, then make the changes on the children as
   necessary. For single-file commits to a branch, we could use
   heuristics (pathname analysis) to add these to a group (and log what
   we did), or we could put them in a "reject" kind of file for a human
   to tell us what to do (the human would edit a config file of some
   kind to instruct the converter).

 * if we have access to the CVSROOT/history, then we could process tags
   properly. otherwise, we can only use heuristics or configuration
   info to group up tags (branches can use commits; there are no
   commits associated with tags)

 * ideally, we store every bit of data from the ,v files to enable a
   complete restoration of the CVS repository. this could be done by
   storing properties with CVS revision numbers and stuff (i.e. all
   metadata not already embodied by SVN would go into properties)

 * how do we track the "states"? I presume "dead" is simply deleting
   the entry from SVN. what are the other legal states, and do we need
   to do anything with them?

 * where do we put the "description"? how about locks, access list,
   keyword flags, etc.

 * note that using something like the SourceForge repository will be an
   ideal test case. people *move* their repositories there, which means
   that all kinds of stuff can be found in those repositories, from
   wherever people used to run them, and under whatever development
   policies may have been used.

   For example: I found one of the projects with a "permissions 644;"
   line in the "gnuplot" repository. Most RCS releases issue warnings
   about that (although they properly handle/skip the lines).
	I am starting this file on 9/22/2000, before a single token of code
	for cvs2svn has been written. In the future, I may change parts of
	it from future tense to present or past. -- K<bob>

	===========
	ASSUMPTIONS
	===========

	I will state some assumptions up front that don't really seem worthy
	of elaborate justification right now.

	cvs2svn will be a Unix command line tool. No GUI. No HTTP interface.
	No bonoboification. Not an emacs macro.

	cvs2svn's only purpose in life is to convert a CVS repository to an
	subversion repository.

	cvs2svn will only be able to do a big-bang, whole repository
	conversion. It will not be able to sync or merge an existing CVS
	repository with an existing SVN repository. After all, solving that
	problem is just as hard as writing SVN. (-: MAYBE it'll be desirable
	to subset the CVS repository based on dates or directories or vendor
	branches or something, but that will not be a future development.

	=====
	PLANS
	=====

	In this section, I ask some design questions, discuss pros and cons,
	and present my conclusion.

	------------------------------------------------------------------------

	Question:

	Should cvs2svn read the CVS repository directly or should it
	use the cvs client to access it?

	Thoughts:

	If we read the repository directly, we couple ourselves
	tightly to past, present, and future versions of CVS. That is
	bad.

	There may be ambiguity in `cvs log' output, which wouldn't be
	there in the RCS files.

	Using cvs allows conversion of remote repositories with no
	extra effort.

	Conclusion:

	All CVS repository operations will be done through the cvs
	command, if possible.

	------------------------------------------------------------------------

	Question:

	In what language(s) should cvs2svn be written?

	Thoughts:

	Ref. Alternate Hard and Soft Layers.
	http://c2.com/cgi/wiki/wiki?AlternateHardAndSoftLayers

	The rest of the subversion project is written in C.

	Karl has written a Perl script called cvs2cl.pl which
	parses `cvs log' output and pattern-matches similar
	file-granularity log messages into single commits. I'd
	like to reuse as much of that code as possible.
	Rewriting that in C would be tedious.

	Other soft languages, e.g., Python, sh, scheme, Visual
	Basic ( :-) ), exist, but (a) I'm fluent in Perl, (b)
	Perl has a huge installed base (but so does sh), (c)
	Karl's script is in Perl.

	Is it okay to make perl a prerequisite? The svn client
	doesn't require Perl.

	To write the subversion repository, we should call libsvn_fs
	through the svn_delta_edit_fns_t switch. That requires C.

	Conclusion:

	main in Perl. cvs log parsing and commit aggregating in Perl.
	Subversion access in C through the svn_delta_edit_fns_t switch.

	OR... write the whole thing in Perl, and call both the cvs
	client and the svn client through fork/exec or popen.

	------------------------------------------------------------------------

	Question:

	Assuming there's a Perl component and a C component,
	how do they communicate?

	Thoughts:

	We could use XS or swig to call C from Perl. At some point, a
	Perl zealot may want to add that so he can write a perl
	subversion client. I am not that Perl zealot.

	Pipes are nice. I don't know yet whether the C part needs
	bidirectional communication with the perl driver. If it does,
	we could let the driver take over both stdin and stdout of the
	C part.

	I have written code that uses popen (unidirectionally) on
	Windows NT, so pipes are not completely nonportable to Satan's
	OS.

	Conclusion:

	The Perl part forks and execs the C part, and talks to it
	through a pipe. The C part may talk back through another
	pipe.

	TBD whether the C part is exec'd once or once per commit or ...

	------------------------------------------------------------------------

	Question:

	Assuming there's a Perl component, which version of Perl is
	required?

	Thoughts:

	I'll test it with one of the 5.003 series perls, as well as
	the current 5.6.x version.

	I'll refrain from using anything from CPAN that hasn't
	been in the core Perl distribution (and stable) for quite
	a while.

	I don't eat Perl for breakfast, so I may make mistakes here...
	I do know that some of the machines I use don't support \A and
	\z in regexps yet.

	Conclusion:

	Use as old a Perl as possible. Test with back revs.

	Don't use non-core packages (or, if I do, I'll include them
	inline.)

	------------------------------------------------------------------------

	Question: What are the command line arguments?

	Thoughts:

	For now, I want a minimal set. Later, we can add bells and
	whistles and one of those little stripey things that bounces
	up and down and changes colors while it rotates.

	Conclusion:

	Use: cvs2svn [ -d cvs-repository-spec ] new-subversion-root-dir

	Use $CVSROOT if it's there (and `-d ...' isn't).

	Require new-subversion-root-dir to be nonexistent (but its parent
	must exist).

	======
	ISSUES
	======

	In this section, I ask some design questions and discuss pros and
	cons. I don't present a solution, because I don't see a best solution
	yet.

	------------------------------------------------------------------------

	Question:

	What kind(s) and format(s) of user documentation do we want?
	Where is it in the tree?

	Thoughts:

	What's the rest of the project going to use? (Am I the first
	to ask this question?)

	A good man page (whether troff or another format) should
	suffice -- I don't see us needing separate user manual.

	------------------------------------------------------------------------

	Question:

	What errors will cvs report, and how will it report them?

	Thoughts:

	Standard error is my friend.

	The nature of the errors will become obvious as the code
	is written.

	------------------------------------------------------------------------

	Question:

	Should cvs2svn write the Subversion repository directly or
	should it use the svn client to access it?

	Thoughts:

	It seems like it would be easy to use cvs and svn.
	The whole program would be (in pseudocode):

	commits = read_cvs_logs_and_figure_out_what_to_do();
	for commits {
	cvs checkout ...
	svn commit ...
	}

	The svn client isn't written yet. That's a problem.

	The svn client looks in the filesystem at the working copy.
	If cvs2svn calls libsvn_fs directly, we don't need a working
	copy, and we can skip the filesystem I/O. (How much does this
	matter?)

	How would we override date stamps and user names (and what
	else?) on svn commits?

	I wonder how far Karl and Ben (or anybody else) has thought
	through what the svn client's command line is.

	------------------------------------------------------------------------

	Question:

	How hard is it to determine the shape of the repository
	from the `cvs log' output?

	Thoughts:

	If it isn't obvious yet, I am not a CVS power user. I'm
	trying to build a toy repository right now so I can play with
	it and get a better feeling.

	Since CVS doesn't support renaming files directly, I expect to
	need some ad-hocu-ery to recognize renames.

	What happens if the log message contains:
	my text here
	----------------------------
	revision 1.2
	date: 2000/01/10 00:00:00; author: fred; state: Exp; lines: +2 -9
	my other text here
	In other words, a log message contains what looks like a
	log message header.

	------------------------------------------------------------------------

	Question:

	How do we handle the Attic?

	Question:

	How does subversion represent branches and branch names?

	Question:

	How does subversion represent symbolic tags?

	Question:

	What do we do with the files in CVSROOT?

	------------------------------------------------------------------------

	An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
	considerations for the tool.

	------
	From: John Gardiner Myers <jgmyers@speakeasy.net>
	Subject: Thoughts on CVS to SVN conversion
	To: gstein@lyra.org
	Date: Sun, 15 Apr 2001 17:47:10 -0700

	Some things you may want to consider for a CVS to SVN conversion utility:

	If converting a CVS repository to SVN takes days, it would be good for
	the conversion utility to keep its progress state on disk. If the
	conversion fails halfway through due to a network outage or power
	failure, that would allow the conversion to be resumed where it left off
	instead of having to start over from an empty SVN repository.

	It is a short step from there to allowing periodic updates of a
	read-only SVN repository from a read/write CVS repository. This allows
	the more relaxed conversion procedure:

	1) Create SVN repository writable only by the conversion tool.
	2) Update SVN repository from CVS repository.
	3) Announce the time of CVS to SVN cutover.
	4) Repeat step (2) as needed.
	5) Disable commits to CVS repository, making it read-only.
	6) Repeat step (2).
	7) Enable commits to SVN repository.
	8) Wait for developers to move their workspaces to SVN.
	9) Decomission the CVS repository.

	You may forward this message or parts of it as you seem fit.
	------

	------------------------------------------------------------------------

	Further design thoughts from Greg Stein <gstein@lyra.org>

	* timestamp the beginning of the process. ignore any commits that
	occur after that timestamp; otherwise, you could miss portions of a
	commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
	revision for items in B; we missed A)

	* the above timestamp can also be used for John's "grab any updates
	that were missed in the previous pass."

	* for each file processed, watch out for simultaneous commits. this
	may cause a problem during the reading/scanning/parsing of the file,
	or the parse succeeds but the results are garbaged. this could be
	fixed with a CVS lock, but I'd prefer read-only access.

	algorithm: get the mtime before opening the file. if an error occurs
	during reading, and the mtime has changed, then restart the file. if
	the read is successful, but the mtime changed, then restart the
	file.

	* dump file metadata to a separate log file(s). in particular, we want
	the following items for each commit:
	- MD5 hash of the commit message
	- author
	- timestamp

	The above three items are used to coalesce the commit. Remember to
	use a fudge factor for the timestamp. (the fudge cannot be fixed
	because a commit could occur over an arbitrary length of time, based
	on size of commit and the network connection used for the commit;
	figure out an algorithm here)

	All other metadata needs to be preserved, but that can probably
	happen when we re-read the file to generate the SVN revisions.

	We would sort the log file generated above (GNU sort can handle
	arbitrarily large files). Then scan the file progressively,
	generating the commit groups.

	* use a separate log to track unique branches and non-branched forks
	of revision history (Q: is it possible to create, say, 1.4.1.3
	without a "real" branch?). this log can then be used to create a
	/branches/ directory in the SVN repository.

	Note: we want to determine some way to coalesce branches across
	files. It can't be based on name, though, since the same branch name
	could be used in multiple places, yet they are semantically
	different branches. Given files R, S, and T with branch B, we can
	tie those files' branch B into a "semantic group" whenever we see
	commit groups on a branch touching multiple files. Files that are
	have a (named) branch but no commits on it are simply ignored. For
	each "semantic group" of a branch, we'd create a branch based on
	their common ancestor, then make the changes on the children as
	necessary. For single-file commits to a branch, we could use
	heuristics (pathname analysis) to add these to a group (and log what
	we did), or we could put them in a "reject" kind of file for a human
	to tell us what to do (the human would edit a config file of some
	kind to instruct the converter).

	* if we have access to the CVSROOT/history, then we could process tags
	properly. otherwise, we can only use heuristics or configuration
	info to group up tags (branches can use commits; there are no
	commits associated with tags)

	* ideally, we store every bit of data from the ,v files to enable a
	complete restoration of the CVS repository. this could be done by
	storing properties with CVS revision numbers and stuff (i.e. all
	metadata not already embodied by SVN would go into properties)

	* how do we track the "states"? I presume "dead" is simply deleting
	the entry from SVN. what are the other legal states, and do we need
	to do anything with them?

	* where do we put the "description"? how about locks, access list,
	keyword flags, etc.

	* note that using something like the SourceForge repository will be an
	ideal test case. people move their repositories there, which means
	that all kinds of stuff can be found in those repositories, from
	wherever people used to run them, and under whatever development
	policies may have been used.

	For example: I found one of the projects with a "permissions 644;"
	line in the "gnuplot" repository. Most RCS releases issue warnings
	about that (although they properly handle/skip the lines).