| How cvs2svn.py Works |
| ===================== |
| |
| A cvs2svn run consists of 5 passes. Every pass but the last saves its |
| data to a file on disk, so that a) we don't hold huge amounts of state |
| in memory, and b) the conversion process is resumable. The final pass |
| makes the actual Subversion commits. |
| |
| Pass 1: |
| ======= |
| |
| The goal of this pass is to get a summary of all the revisions for |
| each file written out to 'cvs2svn-data.revs'; at the end of this |
| stage, revisions will be grouped by RCS file, not by logical commits. |
| |
| We walk over the repository, processing each RCS file with with |
| rcsparse.parse(), using cvs2svn's CollectData class, which is a |
| subclass of rcsparse.Sink(), the parser's callback class. For each |
| RCS file, the first thing the parser encounters is the administrative |
| header, including the head revision, the principal branch, symbolic |
| names, RCS comments, etc. The main thing that happens here is that |
| CollectData.define_tag() is invoked on each symbolic name and its |
| attached revision, so all the tags and branches of this file get |
| collected. |
| |
| Next, the parser hits the revision summary section. That's the part |
| of the RCS file that looks like this: |
| |
| 1.6 |
| date 2002.06.12.04.54.12; author captnmark; state Exp; |
| branches |
| 1.6.2.1; |
| next 1.5; |
| |
| 1.5 |
| date 2002.05.28.18.02.11; author captnmark; state Exp; |
| branches; |
| next 1.4; |
| |
| [...] |
| |
| For each revision summary, CollectData.define_revision() is invoked, |
| recording that revision's metadata in the self.rev_data[] tree. |
| |
| After finishing the revision summaries, the parser invokes |
| CollectData.tree_completed(), which loops over the revisions in |
| self.rev_data, determining if there are instances where a higher |
| revision was committed "before" a lower one (rare, but it can happen |
| when there was clock skew on the repository machine). If there are |
| any, it "resyncs" the timestamp of the higher rev to be just after |
| that of the lower rev, but saves the original timestamp in |
| self.rev_data[blah][3], so we can later write out a record to the |
| resync file indicating that an adjustment was made (this makes it |
| possible to catch the other parts of this commit and resync them |
| similarly, more details below). |
| |
| Next, the parser encounters the *real* revision data, which has the |
| log messages and file contents. For each revision, it invokes |
| CollectData.set_revision_info(), which writes a new line to |
| cvs2svn-data.revs, like this: |
| |
| 3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 C 1.2 * 0 0 foo/bar,v |
| |
| The fields are: |
| |
| 1. a fixed-width timestamp |
| 2. a digest of the log message + author |
| 3. the type of change ("C"hange, or "D"elete) |
| 4. the revision number |
| 5. the branch on which this commit happened, or "*" if not on a branch |
| 6. the number of tags rooted at this revision (followed by their |
| names, space-delimited) |
| 7. the number of branches rooted at this revision (followed by |
| their names, space-delimited) |
| 8. the path of the RCS file in the repository |
| |
| (Of course, in the above example, fields 6 and 7 are "0", so they have |
| no additional data.) |
| |
| Also, for resync'd revisions, a line like this is written out to |
| 'cvs2svn-data.resync': |
| |
| 3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328 |
| |
| The fields are: |
| |
| NEW_TIMESTAMP DIGEST OLD_TIMESTAMP |
| |
| (The resync file will be explained later.) |
| |
| That's it -- the RCS file is done. |
| |
| When every RCS file is done, Pass 1 is complete, and: |
| |
| - cvs2svn-data.revs contains a summary of every RCS file's |
| revisions. All the revisions for a given RCS file are grouped |
| together, but note that the groups are in no particular order. |
| In other words, you can't yet identify the commits from looking |
| at these lines; a multi-file commit will be scattered all over |
| the place. |
| |
| - cvs2svn-data.resync contains a small amount of resync data, in |
| no particular order. |
| |
| Pass 2: |
| ======= |
| |
| This is where the resync file is used. The goal of this pass is to |
| convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean |
| revs). It's the same as the original file, except for some resync'd |
| timestamps. |
| |
| First, read the whole resync file into a hash table that maps each |
| author+log digest to a list of lists. Each sublist represents one of |
| the timestamp adjustments from Pass 1, and looks like this: |
| |
| [old_time_lower, old_time_upper, new_time] |
| |
| The reason to map each digest to a list of sublists, instead of to one |
| list, is that sometimes you'll get the same digest for unrelated |
| commits (for example, the same author commits many times using the |
| empty log message, or a log message that just says "Doc tweaks."). So |
| each digest may need to "fan out" to cover multiple commits, but |
| without accidentally unifying those commits. |
| |
| Now we loop over cvs2svn-data.revs, writing each line out to |
| 'cvs2svn-data.c-revs'. Most lines are written out unchanged, but |
| those whose digest matches some resync entry, and appear to be part of |
| the same commit as one of the sublists in that entry, get tweaked. |
| The tweak is to adjust the commit time of the line to the new_time, |
| which is taken from the resync hash and results from the adjustment |
| described in Pass 1. |
| |
| The way we figure out whether a given line needs to be tweaked is to |
| loop over all the sublists, seeing if this commit's original time |
| falls within the old<-->new time range for the current sublist. If it |
| does, we tweak the line before writing it out, and then conditionally |
| adjust the sublist's range to account for the timestamp we just |
| adjusted (since it could be an outlier). Note that this could, in |
| theory, result in separate commits being accidentally unified, since |
| we might gradually the two sides of the range such that they are |
| eventually more than COMMIT_THRESHOLD seconds apart. However, this is |
| really a case of CVS not recording enough information to disambiguate |
| the commits; we'd know we have a time range that exceeds the |
| COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it |
| up. We could try some clever heuristic, but for now it's not |
| important -- after all, we're talking about commits that weren't |
| important enough to have a distinctive log message anyway, so does it |
| really matter if a couple of them accidentally get unified? Probably |
| not. |
| |
| Pass 3: |
| ======= |
| |
| This is where we deduce the changesets, that is, the grouping of file |
| changes into single commits. |
| |
| It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it |
| to 'cvs2svn-data.s-revs'. Because of the way the data is laid out, |
| this causes commits with the same digest (that is, the same author and |
| log message) to be grouped together. Poof! We now have the CVS |
| changes grouped by logical commit. |
| |
| In some cases, the changes in a given commit may be interleaved with |
| other commits that went on at the same time, because the sort gives |
| precedence to date before log digest. However, Pass 4 detects this by |
| seeing that the log digest is different, and reseparates the commits. |
| |
| Pass 4: |
| ======= |
| |
| Walk through cvs2svn-data.s-revs and print the commits to a Subversion |
| dumpfile (a file intended for 'svnadmin load'). The dumpfile is the |
| data's last static stage: last chance to check over the data, run it |
| through svndumpfilter, move the dumpfile to another machine, etc. |
| |
| =============================== |
| Branches and Tags Plan. |
| =============================== |
| |
| This pass is also where tag and branch creation is done. Since |
| subversion does tags and branches by copying from existing revisions |
| (then maybe editing the copy, making subcopies underneath, etc), the |
| big question for cvs2svn is how to achieve the minimum number of |
| operations per creation. For example, if it's possible to get the |
| right tag by just copying revision 53, then it's better to do that |
| than, say, copying revision 51 and then sub-copying in bits of |
| revision 52 and 53. |
| |
| Also, since CVS does not version symbolic names, there is the |
| secondary question of *when* to create a particular tag or branch. |
| For example, a tag might have been made at any time after the youngest |
| commit included in it, or might even have been made piecemeal; and the |
| same is true for a branch, with the added constraint that for any |
| particular file, the branch must have been created before the first |
| commit on the branch. |
| |
| Answering the second question first: cvs2svn creates tags and branches |
| as late as possible. For branches, this is "just in time" creation -- |
| the moment it sees the first commit on a branch, it snaps the entire |
| branch into existence (or as much of it as possible), and then outputs |
| the branch commit. |
| |
| The reason we say "as much of it as possible" is that it's possible to |
| have a branch where some files have branch commits occuring earlier |
| than the other files even have the source revisions from which the |
| branch sprouts (this can happen if the branch was created piecemeal, |
| for example). In this case, we create as much of the branch as we |
| can, that is, as much of it as there are source revisions available to |
| copy, and leave the rest for later. "Later" might mean just until |
| other branch commits come in, or else during a cleanup stage that |
| happens at the end of this pass (about which more later). |
| |
| All tags are created during the cleanup stage, after all regular |
| commits have been made. That way there's no need to worry whether all |
| the required revisions for a particular tag have been committed yet, |
| and it's as correct as any other time, since no one can tell when a |
| tag was made anyway. |
| |
| How just-in-time branch creation works: |
| |
| In order to make the "best" set of copies/deletes when creating a |
| branch, cvs2svn keeps track of two sets of trees while it's making |
| commits: |
| |
| 1. A skeleton mirror of the subversion repository, that is, an |
| array of revisions, with a tree hanging off each revision. (The |
| "array" is actually implemented as an anydbm database itself, |
| mapping string representations of numbers to root keys.) |
| |
| 2. A tree for each CVS symbolic name, and the svn file/directory |
| revisions from which various parts of that tree could be copied. |
| |
| Both tree sets live in anydbm databases, using the same basic schema: |
| unique keys map to marshal.dumps() representations of dictionaries, |
| which in turn map entry names to other unique keys: |
| |
| root_key ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... } |
| entrykey1 ==> { entrynameX : entrykeyX, ... } |
| entrykey2 ==> { entrynameY : entrykeyY, ... } |
| entrykeyX ==> { etc, etc ...} |
| entrykeyY ==> { etc, etc ...} |
| |
| (The leaf nodes -- files -- are also dictionaries, for simplicity.) |
| |
| Both file and directory dictionaries store metadata under special keys |
| whose names start with "/", so they can always be distinguished from |
| entries (for example, search for "/mutable", "/openings", or |
| "/closings" in cvs2svn.py). |
| |
| The repository mirror allows cvs2svn to remember what paths exist in |
| what revisions. For each file path in a revision, it records what |
| tags and branches can sprout from that revision; when the file |
| changes, these attributes do not propagate to the new revision, since |
| the symbolic name isn't based on that revision. |
| |
| The symbolic name trees are all stored in one db file, as paths, where |
| the first element in each path is the symbolic name, and the rest is |
| the full Subversion path to the file in question. For example, if the |
| Subversion revision 7 is the root of branch 'Rel_1', this fact would |
| be recorded under the path |
| |
| '/Rel_1/myproj/trunk/lib/driver.c' |
| |
| (the exact layout is dependent on the make_path() function in |
| cvs2svn.py, which may change). |
| |
| root_key ==> { 'Rel_1' : 'a', ... } |
| 'a' ==> { 'myproj' : 'b', ... } |
| 'b' ==> { 'trunk : 'c', ... } |
| 'c' ==> { 'lib' : 'd', ... } |
| 'd' ==> { 'driver.c' : 'e', ... } |
| 'e' ==> { } |
| |
| The source revision is stored in the leaf node, and also in all the |
| parent nodes, in the manner described in the class documentation for |
| 'SymbolicNameTracker'. The special entries "/opening" and "/closing" |
| are not shown above, for brevity, but their values are where the |
| revision ranges are stored (that is, the ranges indicating when this |
| path could be copied from to produce the tag or branch in question). |
| |
| When it's time to create a branch or tag, cvs2svn.py walks the |
| appropriate symbolic name tree, calculating the ideal source revision |
| for each subpath (see 'SymbolicNameTracker' for the exact algorithm) |
| and emitting the minimum number of copies to the dumpfile and to the |
| skeleton repository mirror. As it goes, it marks each path as |
| emitted, so that we don't redo the same copies during the cleanup |
| phase later on. |
| |
| At this point, the entire branch is done except for: |
| |
| 1. Any source revisions that haven't yet been committed (this is |
| a rare situation, but anyway such revisions will automatically |
| be handled later by the same algorithm, invoked either due to |
| another commit on the branch, or in the cleanup phase), and |
| |
| 2. Files that were accidentally copied onto the branch as part of a |
| subtree, but which don't actually belong on the branch, because |
| the corresponding CVS file doesn't contain that tag. |
| |
| We handle (2) by doing tree diffs between the newly copied tree in the |
| skeleton repository mirror, and the corresponding portion of the |
| symbolic name tree. If the skeleton mirror has a file that's not in |
| the symbolic name tree, we emit a delete to the dumpfile and remove |
| that path from the skeleton mirror. |
| |
| The cleanup phase happens after all regular changes have been |
| processed. Just loop over the "root directory" of the symbolic name |
| tree, running the same creation algorithm on each name (we'll have to |
| distinguish between branches and tags, probably through a special |
| entry on the directory object), skipping parts of the tree already |
| marked as copied. |
| |
| |
| Pass 5: |
| ======= |
| |
| Load the dumpfile into Subversion. VoilĂ . |
| |
| |
| |
| |
| -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- |
| |
| Some older notes and ideas about cvs2svn. Not deleted, because they |
| may contain suggestions for future improvements in design. |
| |
| ----------------------------------------------------------------------- |
| |
| An email from John Gardiner Myers <jgmyers@speakeasy.net> about some |
| considerations for the tool. |
| |
| ------ |
| From: John Gardiner Myers <jgmyers@speakeasy.net> |
| Subject: Thoughts on CVS to SVN conversion |
| To: gstein@lyra.org |
| Date: Sun, 15 Apr 2001 17:47:10 -0700 |
| |
| Some things you may want to consider for a CVS to SVN conversion utility: |
| |
| If converting a CVS repository to SVN takes days, it would be good for |
| the conversion utility to keep its progress state on disk. If the |
| conversion fails halfway through due to a network outage or power |
| failure, that would allow the conversion to be resumed where it left off |
| instead of having to start over from an empty SVN repository. |
| |
| It is a short step from there to allowing periodic updates of a |
| read-only SVN repository from a read/write CVS repository. This allows |
| the more relaxed conversion procedure: |
| |
| 1) Create SVN repository writable only by the conversion tool. |
| 2) Update SVN repository from CVS repository. |
| 3) Announce the time of CVS to SVN cutover. |
| 4) Repeat step (2) as needed. |
| 5) Disable commits to CVS repository, making it read-only. |
| 6) Repeat step (2). |
| 7) Enable commits to SVN repository. |
| 8) Wait for developers to move their workspaces to SVN. |
| 9) Decomission the CVS repository. |
| |
| You may forward this message or parts of it as you seem fit. |
| ------ |
| |
| ----------------------------------------------------------------------- |
| |
| Further design thoughts from Greg Stein <gstein@lyra.org> |
| |
| * timestamp the beginning of the process. ignore any commits that |
| occur after that timestamp; otherwise, you could miss portions of a |
| commit (e.g. scan A; commit occurs to A and B; scan B; create SVN |
| revision for items in B; we missed A) |
| |
| * the above timestamp can also be used for John's "grab any updates |
| that were missed in the previous pass." |
| |
| * for each file processed, watch out for simultaneous commits. this |
| may cause a problem during the reading/scanning/parsing of the file, |
| or the parse succeeds but the results are garbaged. this could be |
| fixed with a CVS lock, but I'd prefer read-only access. |
| |
| algorithm: get the mtime before opening the file. if an error occurs |
| during reading, and the mtime has changed, then restart the file. if |
| the read is successful, but the mtime changed, then restart the |
| file. |
| |
| * use a separate log to track unique branches and non-branched forks |
| of revision history (Q: is it possible to create, say, 1.4.1.3 |
| without a "real" branch?). this log can then be used to create a |
| /branches/ directory in the SVN repository. |
| |
| Note: we want to determine some way to coalesce branches across |
| files. It can't be based on name, though, since the same branch name |
| could be used in multiple places, yet they are semantically |
| different branches. Given files R, S, and T with branch B, we can |
| tie those files' branch B into a "semantic group" whenever we see |
| commit groups on a branch touching multiple files. Files that are |
| have a (named) branch but no commits on it are simply ignored. For |
| each "semantic group" of a branch, we'd create a branch based on |
| their common ancestor, then make the changes on the children as |
| necessary. For single-file commits to a branch, we could use |
| heuristics (pathname analysis) to add these to a group (and log what |
| we did), or we could put them in a "reject" kind of file for a human |
| to tell us what to do (the human would edit a config file of some |
| kind to instruct the converter). |
| |
| * if we have access to the CVSROOT/history, then we could process tags |
| properly. otherwise, we can only use heuristics or configuration |
| info to group up tags (branches can use commits; there are no |
| commits associated with tags) |
| |
| * ideally, we store every bit of data from the ,v files to enable a |
| complete restoration of the CVS repository. this could be done by |
| storing properties with CVS revision numbers and stuff (i.e. all |
| metadata not already embodied by SVN would go into properties) |
| |
| * how do we track the "states"? I presume "dead" is simply deleting |
| the entry from SVN. what are the other legal states, and do we need |
| to do anything with them? |
| |
| * where do we put the "description"? how about locks, access list, |
| keyword flags, etc. |
| |
| * note that using something like the SourceForge repository will be an |
| ideal test case. people *move* their repositories there, which means |
| that all kinds of stuff can be found in those repositories, from |
| wherever people used to run them, and under whatever development |
| policies may have been used. |
| |
| For example: I found one of the projects with a "permissions 644;" |
| line in the "gnuplot" repository. Most RCS releases issue warnings |
| about that (although they properly handle/skip the lines). |