| |
| The Tradeoff |
| ============ |
| |
| CVS indexes data in a certain way. When you create a CVS tag, a label |
| must be applied to every single file in the repository. It takes O(N) |
| time, where N is the size of the tree you're tagging. The tradeoff, |
| however, is that someone can look at a specific version of a file, and |
| see *all* the tag-labels attached to it. |
| |
| Subversion's repository has the data indexed in the other direction. |
| When you create an SVN tag, it makes a single new directory node that |
| points to an existing tree. It takes O(1) (constant) time. The |
| tradeoff, however, is that a version of a file is 'shared' by any |
| number of directory paths. That means it takes O(N) time to find |
| every tag that contains the specific file. |
| |
| |
| Why? |
| === |
| |
| Why does Subversion index data this way? There are a few reasons the |
| designers chose to do this. Having branches and tags in normal |
| directory space makes it easy to browse them, easy to do access |
| control on them, and (of course) they're automatically versioned. |
| |
| Also, the designers thought this would be optimizing the right |
| operations. Organizations tend to create branches and tags quite |
| frequently -- much more frequently than asking the question "which |
| tags contain a specific file?" So if you can only make one of these |
| operations O(1), you want it to be tagging. (Of course, if we knew a |
| way to make them *both* cheap, that would be the best solution! But |
| we haven't found a way yet.) |
| |
| |
| |
| Questions that Users Ask |
| ======================== |
| |
| Here are some questions subversion users might ask, and how subversion |
| deals with each question. |
| |
| |
| 1. "What version of foo.c is in tag X?" |
| |
| This is the easiest question to answer. Go into the tag-tree, and |
| look at the version of foo.c it contains. |
| |
| (This can be done with a simple "svn ls -v URL", where URL is a path |
| to the specific tag directory. Look at the first column of numbers.) |
| |
| |
| |
| 2. "Does tag X contain the latest version of foo.c?" |
| |
| This is a bit harder to answer. From question 1, it's easy to see |
| that tag X contains version N of foo.c. But how do we know if that's |
| the *latest* foo.c? |
| |
| Running 'svn log' on version N of foo.c won't help, because it only |
| goes backwards in time. That is, it only shows predecessor nodes, not |
| successor nodes. |
| |
| Subversion-1.0 uses BerkeleyDB. The only reason 'svn log' shows |
| predecessor nodes easily is because each node contains a back-pointer |
| to its predecessor. It would be extremely painful to search BDB for |
| successors; BDB is mostly a glorified hashtable with transactions. |
| |
| [Note from kfogel: Is there some reason we can't store successor |
| nodes at commit time? That is if N is a director successor of M, |
| then when we create N, we add it to M's "successors list". Then we |
| could track forward as well as backward... Nothing against having |
| an SQL backend someday, of course, just pointing out that this |
| particular problem can be solved simply in Berkeley DB.] |
| |
| Post-1.0 Subversion, however, will be able to use a SQL backend, and |
| then it will be very quick and easy to query for node successors. At |
| that point, Subversion could make nice "complete history" graphs of |
| nodes, just like Clearcase does. |
| |
| |
| |
| 3. "Which tags contain version N of foo.c?" |
| |
| This is the killer question, and the crux of the Tradeoff mentioned at |
| the beginning of this document. Because Subversion has O(1) tagging, |
| the only way to answer this question is by brute-force searching. |
| |
| But there are two consolations to this tradeoff: |
| |
| A) "Rethink your work habits" |
| |
| From experience, when users ask question #3, it can very often |
| be rephrased as a question about a *specific* tag. Very often, |
| the manager doesn't really want to see the exhaustive list of |
| every tag containing the file; instead, they simply want to know |
| if a *certain* tag has the file. ("Did we give that file to a |
| particular customer?") It turns into a "type-1" question. |
| |
| If you're used to CVS, it's very easy to instantly get the list |
| of all tags attached to a file-version. And therefore you |
| habituate to that, and use the tags-list as your main means of |
| answering all your type-1 questions. But it's certainly not |
| *required* to answer type-1 questions. |
| |
| B) "Build a cache" |
| |
| If a brute-force search is ever performed, it shouldn't be too |
| difficult to cache the results of the search, because repository |
| trees are immutable. That means the next time somebody runs the |
| search, the search becomes *much* smaller. Eventually, the |
| search can dwindle down into what feels like O(1) time, at least |
| when viewed from a distance. :-) |
| |