| This file contains (very old) design notes for the cvs2svn repository |
| converter. We've not yet deleted them, because they may contain |
| suggestions for future improvements in design. |
| |
| ===================================================== |
| |
| |
| An email from John Gardiner Myers <jgmyers@speakeasy.net> about some |
| considerations for the tool. |
| |
| ------ |
| From: John Gardiner Myers <jgmyers@speakeasy.net> |
| Subject: Thoughts on CVS to SVN conversion |
| To: gstein@lyra.org |
| Date: Sun, 15 Apr 2001 17:47:10 -0700 |
| |
| Some things you may want to consider for a CVS to SVN conversion utility: |
| |
| If converting a CVS repository to SVN takes days, it would be good for |
| the conversion utility to keep its progress state on disk. If the |
| conversion fails halfway through due to a network outage or power |
| failure, that would allow the conversion to be resumed where it left off |
| instead of having to start over from an empty SVN repository. |
| |
| It is a short step from there to allowing periodic updates of a |
| read-only SVN repository from a read/write CVS repository. This allows |
| the more relaxed conversion procedure: |
| |
| 1) Create SVN repository writable only by the conversion tool. |
| 2) Update SVN repository from CVS repository. |
| 3) Announce the time of CVS to SVN cutover. |
| 4) Repeat step (2) as needed. |
| 5) Disable commits to CVS repository, making it read-only. |
| 6) Repeat step (2). |
| 7) Enable commits to SVN repository. |
| 8) Wait for developers to move their workspaces to SVN. |
| 9) Decomission the CVS repository. |
| |
| You may forward this message or parts of it as you seem fit. |
| ------ |
| |
| ------------------------------------------------------------------------ |
| |
| Further design thoughts from Greg Stein <gstein@lyra.org> |
| |
| * timestamp the beginning of the process. ignore any commits that |
| occur after that timestamp; otherwise, you could miss portions of a |
| commit (e.g. scan A; commit occurs to A and B; scan B; create SVN |
| revision for items in B; we missed A) |
| |
| * the above timestamp can also be used for John's "grab any updates |
| that were missed in the previous pass." |
| |
| * for each file processed, watch out for simultaneous commits. this |
| may cause a problem during the reading/scanning/parsing of the file, |
| or the parse succeeds but the results are garbaged. this could be |
| fixed with a CVS lock, but I'd prefer read-only access. |
| |
| algorithm: get the mtime before opening the file. if an error occurs |
| during reading, and the mtime has changed, then restart the file. if |
| the read is successful, but the mtime changed, then restart the |
| file. |
| |
| * dump file metadata to a separate log file(s). in particular, we want |
| the following items for each commit: |
| - MD5 hash of the commit message |
| - author |
| - timestamp |
| |
| The above three items are used to coalesce the commit. Remember to |
| use a fudge factor for the timestamp. (the fudge cannot be fixed |
| because a commit could occur over an arbitrary length of time, based |
| on size of commit and the network connection used for the commit; |
| figure out an algorithm here) |
| |
| All other metadata needs to be preserved, but that can probably |
| happen when we re-read the file to generate the SVN revisions. |
| |
| We would sort the log file generated above (GNU sort can handle |
| arbitrarily large files). Then scan the file progressively, |
| generating the commit groups. |
| |
| * use a separate log to track unique branches and non-branched forks |
| of revision history (Q: is it possible to create, say, 1.4.1.3 |
| without a "real" branch?). this log can then be used to create a |
| /branches/ directory in the SVN repository. |
| |
| Note: we want to determine some way to coalesce branches across |
| files. It can't be based on name, though, since the same branch name |
| could be used in multiple places, yet they are semantically |
| different branches. Given files R, S, and T with branch B, we can |
| tie those files' branch B into a "semantic group" whenever we see |
| commit groups on a branch touching multiple files. Files that are |
| have a (named) branch but no commits on it are simply ignored. For |
| each "semantic group" of a branch, we'd create a branch based on |
| their common ancestor, then make the changes on the children as |
| necessary. For single-file commits to a branch, we could use |
| heuristics (pathname analysis) to add these to a group (and log what |
| we did), or we could put them in a "reject" kind of file for a human |
| to tell us what to do (the human would edit a config file of some |
| kind to instruct the converter). |
| |
| * if we have access to the CVSROOT/history, then we could process tags |
| properly. otherwise, we can only use heuristics or configuration |
| info to group up tags (branches can use commits; there are no |
| commits associated with tags) |
| |
| * ideally, we store every bit of data from the ,v files to enable a |
| complete restoration of the CVS repository. this could be done by |
| storing properties with CVS revision numbers and stuff (i.e. all |
| metadata not already embodied by SVN would go into properties) |
| |
| * how do we track the "states"? I presume "dead" is simply deleting |
| the entry from SVN. what are the other legal states, and do we need |
| to do anything with them? |
| |
| * where do we put the "description"? how about locks, access list, |
| keyword flags, etc. |
| |
| * note that using something like the SourceForge repository will be an |
| ideal test case. people *move* their repositories there, which means |
| that all kinds of stuff can be found in those repositories, from |
| wherever people used to run them, and under whatever development |
| policies may have been used. |
| |
| For example: I found one of the projects with a "permissions 644;" |
| line in the "gnuplot" repository. Most RCS releases issue warnings |
| about that (although they properly handle/skip the lines). |