| -*- Text -*- |
| |
| Content |
| ======= |
| |
| * Context |
| * Requirements |
| * Nice-to-have's |
| * Non-goals |
| * Open items / discussion points |
| * Problems in wc-1.0 |
| * Possible solutions |
| - Developer sanity |
| - Speed |
| - Cross node type change representation |
| - Flexibility of metadata storage |
| - Transaction duration / memory management |
| - Working copy stability |
| - Transactional updates |
| - Work queue |
| * Prerequisites for a good wc implementation |
| * Modularization |
| * Implementation proposals for |
| - Mapping of svn_wc_entry_t fields to BASE/WORKING |
| - Basic storage mechanics |
| - Metadata schemas |
| - Commit process |
| - Random notes |
| - Code organization |
| - svn_wc.h API |
| * Upgrading old working copies |
| * Implementation plan |
| |
| |
| Context |
| ======= |
| |
| The working copy library has traditionally been a complex piece of |
| machinery and libsvn_wc-1.0 (wc-1.0 hereafter) was more a result of |
| evolution than it was a result of design. This can't be said to be |
| anybody's fault as much as it was unawareness of the developers at |
| the time with the problem(s) inherent to versioning trees instead of |
| files (as was the usual context within CVS). As a result, the WC |
| has been one of the most fragile areas of the Subversion versioning |
| model. |
| |
| The wc is where a large number of issues come together which can |
| be considered separate issues in the remainder of the system, or |
| don't have any effect on the rest of the system at all. The following |
| things come to mind: |
| |
| * Different behaviours required by different use-cases (users) |
| For example: some users want mtime's at checkout time |
| to be the checkout time, some want it to be the historical |
| value at check-in time (and others want different variants). |
| * Different filesystems behave differently, yet Subversion |
| is a cross platform tool and tries to behave the same on all |
| filesystems (timestamp resolution may be an example of this). |
| |
| When considering the wc-1.0 design, one finds that there are a lot of |
| situations where the state of the versioned tree is poorly defined. |
| To clarify the tree state, the wc-ng design splits it into several |
| pieces. A versioned item in the working copy is described by one or |
| more nodes in the following schema. |
| |
| * BASE: Nodes describing repository data in the WC. |
| |
| Each BASE node corresponds to a particular repository URL and |
| revision. Mixed-revision working copies are still common. |
| Pristine file content is kept in the content store. |
| |
| * WORKING: Nodes describing structural changes to BASE. |
| |
| An item in the WC may have a BASE node, a WORKING node, or both. |
| The WORKING-only case arises when an item is added, copied-here, |
| or moved-here on a path that doesn't exist in BASE. |
| |
| * NODE_DATA: Nodes describing layered structural changes to WORKING. |
| |
| An example: suppose a directory is replaced, a directory is |
| copied into it, and a file is added to the subdirectory. BASE |
| and WORKING describe the replacement, NODE_DATA describes the |
| copied subdirectory, and NODE_DATA (again) describes the added |
| file. |
| |
| * ACTUAL: Nodes describing content modifications and annotations. |
| |
| An ACTUAL node contains one or more of the following: |
| |
| - Text-modified flag (for a file) |
| - Properties |
| - Changelist name |
| - Conflict metadata |
| |
| The text-modified flag may be out of sync with the file's real |
| status. The other data is set directly by Subversion operations. |
| |
| Some tree conflicts lead to ACTUAL nodes that don't exist in the |
| working copy (e.g., when a merge tries to edit a file that |
| doesn't exist). These have no corresponding BASE, WORKING or |
| NODE_DATA nodes. |
| |
| ##BHB: For tree, text and property conflicts it would be nice to handle |
| MINE, THEIRS (and OLDER?) as semi-trees too. See this mail thread |
| http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=984982 |
| This would provide a clean way to access all information about the conflict |
| origins: $ svn info file@THEIRS |
| |
| |
| Requirements |
| ============ |
| |
| * Developer sanity |
| From this requirement, a number of additional ones follow: |
| - Very explicit tree state management; clear difference between |
| each of the 5 states we may be looking at |
| - It must be "fun" to code wc-ng enhancements |
| * Speed |
| (Note: a trade off may be required for 'checkout' vs 'status' speed) |
| * Cross-node-type working copy changes |
| * Flexibility |
| The model should make it easy to support |
| - central vs local metadata storage |
| - Last modified timestamp behaviours |
| - .svn-less working copy subtrees |
| - different file-changed detection schemes |
| (e.g. full tree scan as in wc-1.0 as well as 'p4 edit') |
| * Graceful (defined) fallback for non-supported operations |
| When a checkout tries to create a symlink on an OS which supports |
| them, on a filesystem which doesn't, we should cope without |
| canceling the complete checkout. Same for marking metadata read-only. |
| * Gracefully handle symlinks in relation to any special-handling of |
| files (don't special-handle symlinks!) |
| * Clear/reparable tree state |
| Other than our current loggy system, I mean here: "there is a command |
| by which the user can restart the command he/she last issued and |
| Subversion will help complete that command", which differs from our |
| loggy system in the way that it will return the working copy to a |
| defined (but to the user unknown) state. |
| * Transactional/ repairable tree state (with which I mean something |
| which achieves the same as our loggy system, but better). |
| * Case sensitive filesystem aware / resilient |
| * Working copy stability; a number of scenario's with switch and |
| update obstructions used to leave the working copy unrecoverable |
| * Client side 'true renames' support where one side can't be committed |
| without the other (relates to issue #876) |
| |
| ###JSS: Perhaps this is obvious... I think that requirement is fine for the |
| user doing the commit. We still need to remember that another user doing |
| the update may not have authz permission to the directory it was renamed |
| into or may have a checkout of a sub-tree and that target directory may |
| not exist. Likewise, the original location might be unavailable too. |
| |
| * Change detection should become entirely internal to libsvn_wc (referring |
| to the fact that libsvn_client currently calls svn_wait_for_timestamps()), |
| even though under 'use-commit-times=yes', this waiting is |
| completely useless. |
| * Last-modified recording as a preparation for solving issue #1256 and |
| as defined in this mail, also linked from the issue: |
| http://svn.haxx.se/dev/archive-2006-10/0193.shtml |
| * Representing "this node is part of a replaced-with-history tree and |
| I'm *not* in the replacement tree" as well as "... and I'm deleted |
| from the replacement tree" [issues #1962 and #2690] |
| |
| |
| Would-be-very-nice-to-have's |
| ============================ |
| |
| * Multiple users with a single working copy (aka shared working copy) |
| * Ending up with an implementation which can use current WCs |
| (without conversion) |
| * Working copies/ metadata storages without local storage of text-bases |
| (other than a few cached ones) |
| |
| |
| Non-goals |
| ========= |
| |
| * Off-line commits |
| * Distributed VC |
| |
| Open items / discussion points |
| ============================== |
| |
| * Files changed during the window "sent as part of commit" to |
| "post commit wc processing"; these are currently explicitly |
| supported. Do we want to keep this support (at the cost of speed)? |
| * Single working copy lock. Should we have one lock which locks the |
| entire working copy, disabling any parallel actions on disjoint |
| parts of the working copy? |
| * Meta data physical read-only marking (as in wc-1.0). Is it still |
| required, or should it become advisory (ie ignore errors on failure)? |
| * Is issue #1599 a real use-case we need to address? |
| (Loosing and regaining authz access with updates in between) |
| |
| |
| Problems in wc-1.0 |
| ================== |
| |
| * There's no way to clear unused parts of the entries cache |
| * The code is littered with path calculations in order |
| to access different parts of the working copy (incl. admin areas) |
| * The code is littered with direct accesses to both wc files and |
| admin area files |
| * It's not always clear at which time log files are being processed |
| (ie transactions are being committed), meaning it's not always |
| clear at which version of a tree one is looking at: the pre or post |
| transformation versions... |
| * There's no support for nested transactions (even though some |
| functions want to start a new transaction, regardless whether one |
| was already started) |
| * It's very hard to determine when an action needs to be written |
| to a transaction or needs to be executed directly |
| * All code assumes local access to admin (meta)data |
| * The transaction system contains non-runnable commands |
| * It's possible to generate combinations of commands, each of which |
| is runnable, but the series isn't |
| * Long if() blocks to sort through all possible states of |
| WORKING, ACTUAL and BASE, without calling it that. |
| * Large if() blocks dealing with the difference between file and |
| directory nodes |
| * Many special-handling if()s for svn:special files |
| * Manipulation of paths, URLs and base-text paths in 1 function |
| * 'Switchedness' of subdirectories has to be derived from the |
| URLs of the parent and the child, but copied nodes also have |
| non-parent-child source URLs... (confusing) |
| * Duplication of data: a 'copied' boolean and a 'copy_source' URL field |
| * Checkouts fail when checking out files of different casing to a case |
| insensitive filesystem |
| * Checkouts fail when marking working copy admin data as read-only |
| is a non-supported FS operation (VFAT or Samba mounts on Linux have |
| this behaviour) |
| * Obstructed updates leave operations half done; in case of a switch, |
| it's not always possible to switch back (because the switch itself |
| may have left now-unversioned items behind) |
| * Directories which have their own children merged into them (which happens |
| when merging a directory-add) won't correctly fold the children into |
| schedule==normal, but instead leave them as schedule==add, resulting in |
| a double commit (through HTTP, other RA layers fold the double add, but |
| that's not the point) [see issue #1962] |
| * transaction files (ie log files) are XML files, requiring correct |
| encoding of characters and other values; given the short expected |
| life-time of a log file and the fact that we're almost completely sure |
| the log file is going to be read by the WC library anyway (no interchange |
| problems), this is a waste of processing time |
| * No strict separation between public and internal APIs: many public |
| APIs also used internally, growing arguments which *should* only |
| matter for internal use |
| * The lock strategy requires writing a file in every directory of a working copy, |
| which severely reduces our performance in several environments. (Windows, |
| NFS). Testing showed that in some cases we used more than 50 seconds on |
| writing 8000 lockfiles before we even started looking what to update. A new |
| lock strategy should reduce the number of writes necessary for locking with |
| depth infinity. |
| |
| Possible solutions |
| ================== |
| |
| Developer sanity |
| ---------------- |
| Strict separation between modules should help keep code focused at one |
| task. Probably some of the required user-specific behaviours can (and |
| should) be hidden behind vtables; for example: setting the file stamp |
| to the commit time, last recorded time or leaving it at the current time |
| should be abstracted from. |
| |
| Access to 'text bases' is another one of these areas: most routines in |
| wc-1.0 don't actually need access to a file (a stream would be fine as |
| well), but since the files are there, availability is assumed. |
| When abstracting all access into streams, the actual administration of |
| the BASE tree can be abstracted from: for all we know the 'tree storage |
| module' may be reading the stream directly off the repository server. |
| [The only module in wc-1.0 which *requires* access to the files is |
| the diff/merge library, because it rewinds to the start of the file |
| during its processing; an operation not supported by streams... and even |
| then, if these routines are passed file handles, they'll be quite |
| happy, meaning they still don't need to know where the text base / |
| source file is...] |
| |
| ###GJS: the APIs should use streams so that we can decompress as the |
| stream is being read. the diff library will need a callback of some |
| kind to perform the rewind, which will effectively just close and |
| reopen the stream. if it rewinds *multiple* times, then we may want |
| to cache the decompressed version of the file. I'll |
| investigate. Given our metadata/base-text storage system, I suspect |
| it will be very easy to cache decompressed copies for a while. |
| |
| ###GJS: a very reasonable strategy is: non-binary files are compressed |
| by default. binaries are stored uncompressed. |
| future improvement: extension-based choices, or some other control |
| |
| In order to keep developers sane, it should be extremely clear at any |
| one time - when operating on a tree - which tree is being operated upon. |
| |
| One way to prevent the lengthy 'if()' blocks currently in wc-1.0, would be |
| to design a dispatch mechanism based on the path-state in WORKING/BASE and the |
| required transformation, dispatching to (small) functions which perform |
| solely that specific task. |
| #####XBC Do please note that this suggests yet another instance of |
| pure polymorphism coded in C. This runs contrary to the |
| developer sanity requirement. |
| ###GJS: agreed with XBC. |
| |
| |
| Speed |
| ----- |
| wc-1.0 assumes the WORKING tree and the ACTUAL tree match, but then |
| goes out of its way to assure they actually do when deemed important. |
| The result is a library which calls stat() a lot more often than need be. |
| |
| One of the possible improvements would be to make wc-ng read all of |
| the ACTUAL state (concentrated in one place, using apr_stat()), keeping |
| it around as long as required, matching it with the WORKING state before |
| operating on either (not only when deemed important!). |
| |
| ###GJS: working copy file counts are unbounded, so we need to be |
| careful about keeping "all" stat results in memory. I'll certainly |
| keep this in mind, however. |
| |
| Working from the ACTUAL tree will also prove to be a step toward clarity |
| regarding the exact tree which is being operated upon. |
| |
| [This suggestion from wc-improvements also applies to wc-ng:] |
| Most operations are I/O bound and have CPU to spare. Consider the virtue |
| of compressed text bases in order to reduce the amount of I/O required. |
| |
| Another idea to reduce I/O is to eliminate atomic-rename-into-place for |
| the metadata part of the working copy: if a file is completely written, |
| store the name of the base-text/prop-text in the entries file, which gets |
| rewritten on most wc-transformations anyway. |
| |
| ###GJS: actually, I believe we *rarely* do full walks over the |
| filesystem multiple times. I doubt we will need to cache stat() |
| information. I think performance will primarily derive from |
| omitting the .svn/ location/walks and opening of multiple files |
| therein, in favor of a single SQLite database open. |
| |
| of course, we'll analyze the situation, but I suspect that we will |
| be in great shape as a natural fallout of our new storage system. |
| |
| |
| Cross node type change representation |
| ------------------------------------- |
| ###GJS: this is not allowed in wc-1, but should be easily possible in |
| wc-ng. the WORKING tree's node kind is different than the BASE |
| tree. no big deal. |
| |
| |
| Flexibility of metadata storage |
| ------------------------------- |
| There are 3 known models for storing metadata as requested by different |
| groups of users: |
| |
| - in-subtree metadata storage (.svn subdir model, as in wc-1.0) |
| ###GJS: euh... aren't we axing this? who has *requested* this? |
| - in-'tree root' metadata storage (working copy central) |
| - detached metadata storage (user-central) |
| - in $HOME/.subversion/ |
| - in arbitrary location (e.g. $HOME is a (slow) NFS mount, and we |
| want the metadata on a local drive, such as /var/...) |
| |
| A solution to implementing each of these behaviours in order to satisfy |
| the wide range of use-cases they solve, would be to define a module |
| interface and implement this interface three times (possibly using vtables). |
| |
| Note that using within-module vtables should be less problematic than our |
| post-1.0 experiences with public vtables (such as the ra-layer vtable): |
| implementation details are allowed to differ between releases (even patch |
| releases). |
| |
| ###GJS: note that we are talking about both metadata AND base-text |
| content. (and yeah, optional and compresses base-texts can be done |
| during this rewrite) Also note that we might be able to share |
| base-text content across working copies if they are all keyed by |
| the MD5 hash into storage directories (under the user-central model) |
| |
| ###GJS: I don't think vtables are needed here. This is simply altering |
| the base location, not a whole new implementation. My plan is to |
| default to the "tree root" model with a .svn subdirectory. If a |
| .svn subdir is not found, then we fall back to looking in the |
| $HOME/.subversion/ directory (some subdir under there). If we |
| *still* don't find it, then some config options will point us to |
| the metadata/base-text location. |
| |
| ###GJS: my plan is to upgrade the working copy if we find a pre-1.7 |
| working copy. all the data will be lifted from the multiple .svn |
| subdirectories, and relocated to the "proper" storage location. |
| This will be a non-reversable upgrade, and will preclude pre-1.7 |
| clients from using that working copy again. Note: because of the |
| "destructive" nature of this upgrade, and the expected duration, we |
| will require the user to perform an explicit action ('svn upgrade') |
| in order to complete the upgrade. However, 1.7 will not be able to |
| *modify* wc-1.0 metadata -- just read it in order to upgrade it to |
| the new storage system. |
| |
| When svn detects an old working copy, then it will error out and |
| request that the user run "svn upgrade" to upgrade their working copy |
| to the new format. |
| |
| The metadata location is determined at one of two points: |
| |
| * checkout time |
| * upgrade time |
| |
| According to the user's config, the metadata will be placed in one of |
| three areas: |
| |
| wcroot: at the root of the working copy in a .svn subdirectory |
| home: in the .subversion/wc/ subdirectory |
| /some/path: stored in the given path |
| |
| All wcroot directories will have a .svn subdirectory. In that |
| directory will be the datastore, or there will be a file that provides |
| two pieces of information: |
| |
| * absolute path to the (centralized) metadata |
| * absolute path of where this wcroot was created |
| |
| With this information, we can link a wcroot to its metadata in the |
| centralized store. If the user has moved the wcroot (the stored path |
| is different from the current/actual path), then Subversion will exit |
| with an error. The user must then ###somehow tell svn that the wc has |
| been copied (duplicate the metadata for the wcroot) or moved (tweak |
| the path stored in the metadata and in the linkage file). Subversion |
| is unable to programmatically determine which operation was used. |
| |
| Note that we use "svn upgrade" as the trigger to *perform* the upgrade. |
| The amount of file opens, parsing, moving, deleting, etc is expected |
| to consume significant amounts of I/O and (thus) cannot simply be done |
| on-the-fly without the user's knowledge and consent. |
| |
| |
| Transaction duration / memory management |
| ---------------------------------------- |
| The current pool-based memory management system is very good at managing |
| memory in a transaction-based processing model. In the wc library, a |
| 'transaction' often spans more than one call into the library. We either |
| need a sane way to handle this kind of situation using pools, or we may |
| need a different memory management strategy in wc-ng. |
| |
| Update (2009-05-10): pool-based management is still being used. we are |
| switching to a "dual pool" system that clarifies the intent of the |
| pools. the result_pool is used for return allocations, and |
| scratch_pool for any temporary allocations. |
| |
| |
| Working copy stability |
| ---------------------- |
| In light of obstructed updates it may not always be desirable to be able |
| to resume the current operation (as currently is the case): in some cases |
| the user may want to abort the operation, in other cases the user may |
| want to resolve the obstruction before re-executing the operation. |
| |
| The solution to this problem could be 'atomic updates': receiving the |
| full working copy transformation, verifying prerequisites, creating |
| replacement files and directories and when all that succeeds, update |
| the working copy. |
| |
| Full workin' copy unit tests: |
| Exactly because the working copy is such an important part of the |
| Subversion experience *and* because of the 'reputation' of wc-1.0, |
| we need a way to ensure wc-ng completely performs according to our |
| expectations. *The* way to ensure we're able to test the most contrived |
| edge-cases is to develop a full unit testing test-suite while developing |
| wc-ng. This will both be a measure to ensure working copy stability |
| as well as developer sanity: in the early stages of the wc-ng develop- |
| ment process, we'll be able to assess how well the design holds up |
| under more difficult 'weather'. |
| |
| ###GJS: agreed. as much as possible, when I (re)implement the old APIs |
| in terms of the new APIs, then I'll write a whitebox test. we'll |
| see how long I keep that up :-P |
| |
| Update (2009-05-10): wc-ng currently passes the entire test |
| suite. Additional tests have been implemented ('entries_tests.py' |
| and other) to try and ensure this continued compatibility. |
| |
| |
| Transactional updates |
| --------------------- |
| |
| .. where 'update' is meant as 'user command', not 'svn update' per se. |
| |
| When applied to files, this can be summarized as: |
| |
| * Receive transformations (update, delete, add) from |
| the server, |
| |
| |
| Work Queue |
| ---------- |
| |
| Certain operations that affect the filesystem require a stateful |
| marker that the operation needs to happen. The best example is when a |
| merge conflict occurs: several "pristine" files need to be placed into |
| the working copy (e.g. somefile.c.r34). Should processing fail after |
| the first file is placed, then we need to "remember" to resume the |
| operation and place the rest of the files. |
| |
| The record of these needed operations will be placed into a "work |
| queue" which is a table recorded in the SQLite database. Much like the |
| original loggy, a working copy will be unusable until these actions |
| are run to completion. |
| |
| Each work item must have the following properties: |
| |
| * order-independent. the work items must be allowed to execute in any |
| sequence. |
| * idempotent. the work item must be able to run an arbitrary number of |
| times. |
| * resumable. whether a previous run completed, or was only partially |
| completed, the work item must be able to complete its operation. |
| * independent. each work item must affect only one node in the logical |
| trees. it can apply to any/all of BASE/WORKING/ACTUAL, but it must |
| apply to a single logical node. |
| * complete. a work item must represent a complete operation which |
| takes the WC from one stable state to another. thus, a work item |
| cannot be used to "return the wc to a stable state" (the operation |
| that made it unstable should be included in the work item). |
| |
| The goal here is to reduce interactions across work items. Each must |
| be completely self-sufficient and resumable. |
| |
| The wc_db API will provide a low-level framework for adding, fetching, |
| and completing these work items. Each work item will be described by a |
| skel, to be interpreted by higher levels. |
| |
| ### the "independent" requirement is subject to discussion. It may be |
| possible to have a work item that touches multiple nodes. As long |
| as it can definitely place those nodes into a specific state, then |
| it might be okay to operate on many. |
| |
| |
| Prerequisites for a good wc implementation |
| ========================================== |
| |
| These prerequisites are to be addressed, either as definitions |
| in this document, or elsewhere in the subversion (source) tree: |
| * Well defined behaviour for cross-node type updates/merges/.. |
| (tree conflicts in particular) |
| * Well defined behaviour for special file handling |
| * Well defined behaviour for operations on locally missing items |
| (see issue #1082) |
| * Well defined change detection scheme for each of the different |
| last-modified handling strategies |
| * No special handling of symlinks: they are first class versioned objects |
| * Well defined behaviour for property changes on updates/merges/... |
| (this is a problem which may resemble tree conflicts!), |
| including 'svn:' special properties |
| * File name manipulation routines (availability) |
| * File name comparison routines (!) (availability; which compensate |
| for the different ways Unicode characters can be represented |
| [re: NFC/NFD Unicode issue]) |
| |
| ###JSS: Talking with ehu on IRC when I asked him about how to handle this |
| issue: "if we accept that some repositories will be unusable with wc-ng, |
| then we can standardize anything that comes in from the server as well as |
| the directory side into the same encoding. we'd be writing files with the |
| standardized encoding." The rest of this conversation centered around the |
| fact that either APR or the OS will convert the filename to the correct |
| form for the filesystem when doing the stat() call. Note, ehu says: "(we'll |
| need to retain the filename we got from the server though: we'll need it to |
| describe the file through the editor interface: the server still allows all |
| encodings.)" |
| |
| * URL manipulation routines (availability) |
| * URL comparison routines (availability; which compensate for |
| different ways the same URL can be encoded; see issue #2490) |
| * Modularization |
| * Agree on a UI to pull in other parts of the same repository |
| (NOT svn:externals) [relates to issue #1167] |
| #####XBC I submit this is a server-side feature that the client |
| (i.e. the WC library) should not know about. |
| * Agree on behaviour for update on moved items (relates to issue #1736) |
| * Case-sensitivity detection code to probe working copy filesystem |
| |
| |
| Implementation proposals |
| ======================== |
| |
| Classification of svn_wc_entry_t fields to BASE/WORKING |
| ------------------------------------------------------- |
| |
| [Note: This section is mainly to clarify the difference between the BASE |
| and WORKING trees, it's not here to mean that we actually need all these |
| fields in wc-ng!] |
| |
| Here are the mappings of all fields from svn_wc_entry_t to the BASE and |
| WORKING trees: |
| |
| +-------------------------------+------+---------+ |
| | svn_wc_entry_t | BASE | WORKING | |
| +-------------------------------+------+---------+ |
| | name | x | x (1)| |
| | revision | x | x (2)| |
| | url | x | x (2)| |
| | repos | x | x (3)| |
| | uuid | x | x (3)| |
| | kind | x | x | |
| | absent | x | | |
| | copyfrom_url | | x | |
| | copyfrom_rev | | x | |
| | conflict_old | | x | |
| | conflict_new | | x | |
| | conflict_wrk | | x | |
| | prejfile | | x | |
| | text_time | | = | |
| | prop_time | | = | |
| | checksum | x | x (2)| |
| | cmt_rev | x | x (2)| |
| | cmt_date | x | x (2)| |
| | cmt_author | x | x (2)| |
| | lock_token | x(6)| | |
| | lock_owner | x | | |
| | lock_comment | x | | |
| | lock_creation_date | x | | |
| | has_props | x | x (4)| |
| | has_prop_mods | | = | |
| | cachable_props | x(5)| x (4)| |
| | present_props | x | x (4)| |
| | changelist | | x | |
| | working_size | | = | |
| | keep_local | | = | |
| | depth | x | x | |
| | schedule | | | |
| | copied | | | |
| | deleted | | | |
| | incomplete | | | |
| +-------------------------------+------+---------+ |
| |
| (1) if this one differs from BASE, it must point to the source of a rename |
| (2) for an add-with-history |
| (3) or can we assume single-repository working copies? |
| (4) can differ from BASE for add-with-history |
| (5) why is this a field at all; can't the WC code know? |
| (6) locks apply to in-repository paths, hence BASE |
| |
| The fields marked with '=' are implementation details of internal detection |
| mechanisms, which means they don't belong in the public interface. |
| |
| Fields with no check are to become obsolete. 'schedule', 'copied' and |
| 'deleted' can be deducded from the difference between the BASE and WORKING |
| or WORKING and ACTUAL trees. 'incomplete' should become obsolete when the |
| goal of 'atomic updates' can be realised, in which case the tree can't be |
| in an incomplete yet locked state. This would also invalidate issue #1879. |
| |
| |
| Basic Storage Mechanics |
| ----------------------- |
| |
| All metadata will be stored into a single SQLite database. This |
| includes all of the "entry" fields *and* all of the properties |
| attached to the files/directories. SQLite transactions will be used |
| rather than the "loggy" mechanics of wc-1.0. |
| |
| ###GJS: note that atomicity across the sqlite database and the content |
| of the ACTUAL tree is freakin' difficult. idea to test: metadata |
| says "not sure of ACTUAL", and when ops complete successfully, then |
| we clear the flag. during any future operation, if the flag is |
| present, then we approach the ACTUAL with extreme prejudice. also |
| note that we can batch clearing of the flags as an optimistic |
| efficiency approach (since if we batch 100 and the last fails, then |
| the other 99 will be slower until the wc-ng determines the ACTUAL |
| is in fine shape and clears the flag for future operations). |
| |
| ###GJS: be wary of sqlite commit performance (based on some of my |
| prior experience with it). must have timing/debugging around the |
| commit operations. may need to use various transaction isolations |
| and/or batching of commits to get proper performance. thus, profile |
| output capability is mandatory to determine if we have issues, and |
| where they occur. |
| |
| ###JSS: I don't see how transactions by themselves can replace loggy. |
| Right now, if you abort something like 'svn update' or 'svn checkout', |
| loggy has recorded all the files to be downloaded, and will pick up |
| where it left off. We did this as an optimization to prevent |
| re-downloading a potentially large amount of data again. Seems like |
| we still need to provide that capability. |
| |
| ###GJS: sqlite transactions replace the atomicity that loggy was |
| originally designed for. it sounds like loggy is also be |
| used as a work queue, and that is easily handled in sqlite. |
| |
| Base text data will be stored in a multi-level directory structure, |
| keyed/named by the checksum (MD5 or SHA1) of the file. The database |
| will record appropriate mappings, content/compression types, and |
| refcounts for the base text files (for the shared case). We will use a |
| single level of directories: |
| |
| TEXT_BASE/7c/7ca344... |
| |
| With 100k files spread across all of a user's working copies, that |
| will put 390 files into each subdirectory, which is quite fine. If the |
| user grows to a million files, then 3900 per subdir is still |
| reasonable. Two levels would effectively mean one file per subdir in |
| typical situations, which is a lot of disk overhead. |
| |
| When the metadata is recorded in a central area (rather than the WC |
| root), then it is possible for the metadata and the base files to |
| become out of date with respect to all the working copies on the |
| system. We will revamp "svn cleanup" to re-tally the base text |
| reference counts, eliminate unreferenced bases, verify that the |
| working copies are still present, ensure the metadata <-> WC |
| integrity, deal with moves of metadata from central -> wc-root (can |
| happen if somebody rm -rf's the wc, then does a checkout and wants the |
| metadata at the wc-root (this time)), and other consistency checks. |
| |
| |
| Metadata Schemas |
| ---------------- |
| |
| see libsvn_wc/wc-metatdata.sql3 |
| |
| The table below describes, in English, the various combinations of |
| "presence" values as the occur in the BASE_NODE and WORKING_NODE |
| tables. |
| |
| BASE_NODE WORKING_NODE DESCRIPTION |
| |
| normal <none> Node has been checked out from the |
| repository normally. |
| absent <none> Server has marked the node as "absent", |
| meaning the user does not have |
| authorization to view the content. |
| excluded <none> The node has been (locally) marked as |
| excluded from the working copy. |
| not-present <none> The node is not present at its current |
| revision. The parent directory has a |
| different revision which states the node |
| *is* present. This state is usually |
| reached by locally deleting a file and |
| committing it. Later, when an update is |
| run, the directory will be bumped to a |
| revision that does not contain the file, |
| and this not-present node will be cleared. |
| incomplete <none> The node is known, but the node's |
| information has not been downloaded |
| (yet) from the server. This typically |
| occurs from an interrupted checkout. The |
| parent directory was added, specifying |
| all the children, but the checkout was |
| stopped before fetching the child. |
| base-deleted * Not allowed. This presence is only valid |
| for the WORKING_NODE table. |
| absent <any> Not allowed. The name exists on the |
| server and cannot be modified in any way. |
| incomplete <any> An update "under" some local changes was |
| terminated before fetching information |
| on this node. |
| <none> normal This node has been locally-added through |
| a simple add, a copy, or a move (other |
| data needs to be examined to determine |
| what operation brought the node here). |
| normal normal The underlying BASE node has been |
| deleted, and a new node has been added |
| in its place (this is a "Replace"). |
| excluded normal A node was excluded from the checkout or |
| update, and we have locally-added a new |
| node to *replace* it. |
| not-present normal A node is no longer present in the BASE |
| tree due to mixed-revision working copy |
| concerns. This is an addition (not a |
| replace) of a new node, in the same |
| location as a node that deleted (and |
| committed) at some point in history. |
| * absent Not allowed. This would imply that the |
| server has prevented our access, but |
| this is a local, uncommitted change. The |
| server cannot block the node. |
| (see Note 1) |
| * excluded An add-with-history or a move has been |
| performed, and this node has been |
| excluded from the working copy. Note: |
| plain adds cannot have an excluded |
| node -- we'd just not add the node. |
| Further note: the root of the copy/move |
| cannot be excluded since we need the |
| source information. The root may be |
| depth==empty, however. |
| * not-present This node has been locally-deleted. This |
| can only occur for a child of a copied |
| or moved subtree (for a plain add, we |
| simply revert the add; and must be a |
| child, or we'd just revert the whole |
| copy or move operation). |
| * incomplete This node is known, but the information |
| is missing. A copy, move, or deletion |
| has been interrupted, leaving a |
| directory with known children, but |
| lacking their state. |
| (see Note 2) |
| normal base-deleted The BASE node has been locally-deleted. |
| excluded base-deleted The node was excluded from the working |
| copy, but has been locally-deleted. |
| (see Note 3) |
| not-present base-deleted Not allowed. There is nothing to |
| delete. The not-present is a tool to |
| represent mixed-rev working copies; |
| there is (logically) no node to delete. |
| |
| |
| "<none>" means there is no row in the given table (so no presence value) |
| "<any>" means "any value" |
| "*" means "<none> or <any>" |
| |
| Note 1: this implies you cannot copy/move a working copy tree that has |
| absent nodes in it. If that were made possible, then we may |
| (instead) want to model this as a copy/move followed by a |
| local-delete of the absent node(s). |
| Note 2: this will probably only apply to a repository-to-wc copy. For |
| wc-to-wc copies/moves, we will probably transact the entire operation |
| so that a child will never be incomplete. |
| Note 3: this may be possible, though we may need more state to pull |
| it off (eg. what revision of the node is "not there" is will be |
| deleted? need the rev for out-of-date checks) |
| |
| |
| Commit Process |
| -------------- |
| |
| Committing is essentially a review of all the rows in the WORKING_NODE |
| and ACTUAL_NODE tables, and sending appropriate instructions to the |
| server. After the commit, those rows are removed and the data |
| "collapsed" down into the BASE_NODE. For example, if a copy has been |
| performed, then WORKING_NODE contains data about the copy, and the new |
| BASE_NODE will "become" that WORKING_NODE. If the user additionally |
| modifies some properties (stored in ACTUAL_NODE), then those will also |
| fall down into the post-commit BASE_NODE row. |
| |
| copy_tests 24 sets up a specific scenario that breaks a lot of the |
| code in libsvn_wc today (Sep 28, 2009). While the commit is being |
| processed, the database is temporarily placed into a state which does |
| have not have proper integrity. |
| |
| This section is an attempt to document rules for how nodes are to be |
| treated during the commit process. These rules focus on the effects |
| upon an individual node, independent of what happens to any child |
| nodes on the theory that children can be configured as a mixed-ref |
| working copy with appropriate operations applied. Note that the |
| decision process is independent of the children, but the children |
| *will* be affected by the commit of the parent node. The commit MUST |
| operate in a top-down fashion, however, since (for example) it is |
| impossible to model a copied parent and an unmodified (committed) BASE |
| child. |
| |
| There is a large operational difference between directories and |
| files/symlinks, so we'll divide the discussion along that line. Note |
| that we only consider the kind of the WORKING node; the kind of the |
| BASE will simply be replaced by the new node. If the BASE used to be a |
| directory, then its (obsolete) children must be removed from BASE_NODE |
| during this commit proces. |
| |
| NOTE: if either the BASE_NODE or the WORKING_NODE has an "incomplete" |
| presence, then it CAN NOT be committed. It means we are missing |
| information that may be required to properly commit a change to that |
| node. |
| ### hmm. during the commit, we create incomplete BASE nodes (see |
| ### below). so this is more of a statement of the starting condition. |
| |
| |
| FILES AND SYMLINKS |
| |
| working-presence: normal |
| base-presence: * |
| The WORKING and ACTUAL data is collapsed down into BASE_NODE, with |
| the new revision. |
| |
| |
| working-presence: excluded |
| base-presence: * |
| A new, excluded BASE_NODE is constructed, and the WORKING_NODE is |
| removed. Any BASE_NODE rows which appear to be descendants of this |
| (used-to-be-directory) node are removed. There should be no |
| descendants in the WORKING_NODE table. |
| |
| ### what information do we keep for excluded nodes? |
| ### note: at this point, there is no user command to exclude |
| ### files/symlinks. but we will be able to at some point... |
| |
| |
| working-presence: not-present |
| base-presence: * |
| NOTE: this situation should never be seen, since the node's parent |
| should have been committed first, which handles this node as part of |
| the child processing (see <normal, *> resolution for directories). |
| |
| ### in short, this work-presence can only occur for a deleted |
| ### *subroot*. and we cannot commit *just* this node. must commit |
| ### the root of any copy/move operation. |
| |
| |
| working-presence: base-deleted |
| base-presence: * |
| The WORKING_NODE and BASE_NODE rows are removed. |
| |
| ### if base-presence == excluded, there are some concerns: |
| ### -- what information do we keep for base=excluded nodes? |
| ### -- the base-deleted node would (at least) need to retain the |
| ### revision in order to mark it for deletion. |
| ### -- note: at this point, there is no user command to exclude |
| ### files/symlinks. but we will be able to at some point... |
| |
| |
| actual: row is available |
| working-presence: <none> |
| base-presence: normal |
| ACTUAL_NODE.properties are folded into the BASE_NODE and the |
| revision is bumped. |
| |
| Note: there is no data in ACTUAL_NODE other than properties. |
| Conflict information exists, but that must be cleared before |
| committing is possible. |
| |
| |
| DIRECTORIES |
| |
| Whenever a directory is bumped to a new revision, the new set of |
| children is provided. This is required, in order to maintain the |
| proper integrity. |
| |
| Example: if a child is to be added in this new revision, but a failure |
| happened between the directory-commit and the add-child processes, |
| then there would be no record of the added child. The directory would |
| not know it was missing a child and would report "at revision R" to |
| the server, implying that the child is present. |
| |
| working-presence: normal |
| base-presence: * |
| The WORKING and ACTUAL data is collapsed down into BASE_NODE, with |
| the new revision. |
| |
| The depth status of the WORKING node is carried over to the BASE |
| node. Children in the WORKING_NODE table should align with that |
| depth value, and the commit will iterate over each available child |
| row. Its status will be examined, and specific action taken: |
| |
| normal |
| An incomplete node should be added for this child. This node |
| will become its own add/copy/move root, and will be handled as a |
| separate action (via recursion over the children). |
| |
| excluded |
| An incomplete node should be added for this child. No action |
| taken. This node will become an excluded BASE node when it is |
| handled as a separate action (via recursion). |
| |
| not-present |
| This row in WORKING_NODE is removed, along with descendant |
| nodes. The directory will not list this node in its (new) set of |
| children. Any BASE_NODE row at this path is also removed, along |
| with any descendant nodes. |
| |
| base-deleted |
| No action taken. This node will be removed when it is handled |
| as a separate action (via recursion). |
| |
| |
| working-presence: excluded |
| base-presence: * |
| A new, excluded BASE_NODE is constructed, and the WORKING_NODE is |
| removed. Any BASE_NODE rows which appear to be descendants of this |
| (used-to-be-directory) node are removed. There should be no |
| descendants in the WORKING_NODE table. |
| |
| ### what information do we keep for excluded nodes? |
| |
| |
| working-presence: not-present |
| base-presence: * |
| NOTE: this situation should never be seen, since the node's parent |
| should have been committed first, which handles this node as part of |
| the child processing (see <normal, *> resolution). |
| |
| |
| working-presence: base-deleted |
| base-presence: * |
| The WORKING_NODE and BASE_NODE rows are removed. |
| |
| |
| actual: row is available |
| working-presence: <none> |
| base-presence: normal |
| ACTUAL_NODE.properties are folded into the BASE_NODE and the |
| revision is bumped. |
| |
| Note: there is no data in ACTUAL_NODE other than properties. |
| Conflict information exists, but that must be cleared before |
| committing is possible. |
| |
| |
| Random Notes |
| ------------ |
| |
| ### break down all modification operations to things that operate on a |
| small/fixed set of rows. if a large sequence of operations fails, |
| then it can leave the system in reparable state, since most were |
| performed. note that ACTUAL can change at any time, thus all mods |
| should be able to compensate for ACTUAL being something |
| unexpected. thus, the transformative operations should be able to |
| fail in such a way as to leave ACTUAL pretty bunged up. |
| |
| ### probably want to special-case the checksum and BASETEXT entry for |
| the "empty file" |
| |
| |
| Code Organization |
| ----------------- |
| |
| libsvn_wc/wc_db.h (symbols: svn_wc__db_*) |
| Storage subsystem for the WC metadata/base-text information. |
| This is a private API, and the rest of the WC will be rebuilt |
| on top of this. |
| |
| This code deals with storage, and transactional modifications |
| of the data. |
| |
| Note: this is a random-access, low-level API. Editors will be |
| built on top of this layer. |
| |
| libsvn_wc/workqueue.h (symbols: svn_wc__wq_*) |
| The "work queue" is a subsystem to replace the old "loggy" |
| subsystem. It will perform (primarily) filesystem operations |
| in a transactional way. |
| |
| |
| svn_wc.h API |
| ------------ |
| |
| Note that we also have an opportunity to revamp the WC API. Things |
| like access batons will definitely disappear, but there will most |
| likely be great opportunities for other design changes. |
| |
| Note that removing access batons (and other API changes) will ripple |
| up until libsvn_client, and may even have an affect on *its* API. |
| |
| ### the form of a new API is unknown/TBD. |
| |
| We are going to add svn_wc_context_t to be created once per process, |
| and passed to all svn_wc functions. This will replace the (often |
| confusing) use of access batons. |
| |
| Implementation note: this context will hold an svn_wc__db_t handle, |
| and a pointer to the process's svn_config_t object. |
| |
| |
| Upgrading old working copies |
| ============================ |
| When WC-NG finds a working copy which is pre-wc-ng, it will quit, prompting the |
| user to run 'svn upgrade' to upgrade the working copy to a wc-ng state. The |
| reason for not upgrading on-the-fly is two-fold: |
| * We anticipate this process to be irreversible, so we want to ensure the |
| user wants to upgrade (no silent upgrading/breakage). |
| * The upgrade may be I/O and computationally intensive, and keeping with the |
| principle of least surprise, we want to ensure that the upgrade is done |
| intentionally, when the user expects it. |
| |
| Here's how we plan on implementing 'svn upgrade', so that it maintains |
| consistency across the working copy atomically. We have two requirements: |
| do the upgrade completely, and *don't* leave the working copy in an unusable |
| state if the upgrade fails. The steps for upgrading are: |
| |
| 1) Create the new wc.db as invisible.db in the working copy root |
| 2) Upgrade the current directory into invisible.db |
| 3) Drop a flag file into the directory to signal an "in process" upgrade |
| 4) Recurse on step 2 for each of the subdirectories |
| 5) Move invisible.db to wc.db in the working copy root |
| 6) Recursively remove each of the .svn subdirs for each wc subdir |
| |
| Note that nowhere do we attempt to run or upgrade old logs. This is |
| intentional. In order to simplify the development and maintenance burden, |
| we intend to bail when the upgrade process encounters a working copy with |
| un-run logs. In this state, it will be up to the user to run 'svn cleanup' |
| with the prior version of Subversion to ensure the working copy is in an |
| upgradable state. Failing that, the user can always do a fresh checkout. |
| |
| The atomic step is step 5. Should the upgrade process be interrupted prior |
| to Step 5, the working copy will still be usable by a pre-wc-ng client, but |
| will just have extra stuff in the .svn directories, namely invisible.db in |
| the root, and the various flag files everywhere. Should the upgrade get |
| interrupted *after* Step 5 (but before all the .svn directories are removed), |
| the .svn directories will show up as unversioned directories. Not the ideal, |
| but not terribly bad, either. |
| |
| ### perhaps we should have a way of cleaning up all the .svn dirs from an |
| interrupted upgrade? |
| |
| ### what if somebody attempts to use an old Subversion on a working copy |
| with a .svn which hasn't been harvested yet? it should succeed, but |
| may leave a discrepancy between the wc-ng database and the .svn |
| metadata, since old Subversions don't know to recurse up the tree. |
| could we do something to the working copy to render it unusable by older |
| subversions, but without too much pain? how about simply bumping the |
| format number in entries.c? a properly chosen format number would make |
| older Subversions complain, but also make Subversion 1.7 prompt for an |
| upgrade. hmm..... |
| |
| When an upgrade is restarted, invisible.db will just be blown away and |
| recreated from scratch, since already-upgraded directories could have been |
| modified between invocations of 'svn upgrade'. |
| |
| |
| Implementation Plan |
| =================== |
| The following are tests which need to be accomplished for WC-NG. There |
| isn't a strict ordering here, but rather a possible plan. There may be |
| dependencies between some items, but that is left as an exercise for the |
| reader. |
| |
| * Pristine file management |
| * Properties management |
| * Tree management (BASE v. WORKING v. ACTUAL for APIs and storage) |
| * Journaled actions |
| * Finding/using the correct admin area |
| * Upgrading |
| - Including multiple heterogenous admin areas |
| * Move entries into SQLite |
| * Relocating datastore in useful ways |
| |
| Afterwards, we'll need: |
| * A second pass at the WC code to find/fix patterns and solutions. |
| * Revamp of WC API, to propagate up into libsvn_client. |
| * Reexamine any client/wc interactions, and look for final cleanups. |
| |
| Near-Term Plan |
| -------------- |
| |
| Note: we originally envisioned the "ordering" below. In practice, |
| however, we have been attacking the overall problem from multiple |
| angles. Typically, we are finding conceptual/API bottlenecks that |
| make it hard to accomplish a number of other tasks. We solve the |
| bottleneck, and move on to solving the higher-level problems. It |
| is a continual "evolutionary" process. Some temporary APIs are being |
| introduced to help bridge the conceptual gap between wc-1 and wc-ng. |
| These should disappear by release time, but serve to mitigate code |
| disruption and potential for error. |
| |
| 1. convert entries.c to use sqlite directly. migrate 'entries' file |
| during this step. the sqlite file will be in-memory if we are not |
| allowed to auto-upgrade the WC; otherwise, we'll write the sqlite |
| database into .svn/ |
| note: the presence of 'wc.db' (or whatever its name) will indicate |
| a minimum format level. the user field in the database |
| contains the schema version which is our further format-level |
| descriptor value. |
| [ this has been largely done. ] |
| |
| 2. convert entries.c to use wc_db. shift the sqlite code into wc_db. |
| note: this is a separate step from 1. there is a paradigm shift |
| between how entries.c works and wc_db works. we want to |
| ignore that in Step 1, and then handle it in this Step. |
| note: put wc_db handle into lock->shared and share the handle |
| across all directories/batons. |
| [ because of the borken way the upper layers use the entries API, using |
| the wc_db APIs to write entries proves difficult, since we violate |
| all kinds of constraints. ] |
| |
| 3. convert props.c to use wc_db. migrate props to db simultaneously. |
| [ this is currently in-process. ] |
| |
| 4. implement 'svn cleanup' as an upgrade path from old-style working |
| copies to wc-ng |
| [ done, but work will continue as the wc-ng format continues to evolve. ] |
| |
| 5. incremental shift of pristines from N files into pristine db. |
| note: we could continue to leave .revert-base while we migrate the |
| primary base into the pristine dataset. |
| |
| 6. shift libsvn_wc from using entries.h to using wc_db.h. |
| note: since entries.h is "merely" a wrapper for wc_db.h, this will |
| allow the libsvn_wc to start using the new wc_db APIs |
| wherever it is easy/possible. |
| goal: all libsvn_wc code uses wc_db.h, and entries.h exists solely |
| to support old backwards-compat code. |
| [ this isn't quite a discreet task, but is happening gradually as we work |
| through the libsvn_wc API. ] |
| |
| 7. centralize the metadata and pristines |
| note: this will also involve merging datastores |
| |
| 8. replace loggy with sqlite-based work journal |
| |
| |
| Endgame |
| ------- |
| |
| As WC-NG development has progress, and many of the above milestones have been |
| met, we've identified the following milestones leading to the completion of |
| wc-ng development (and hence 1.7). They are not all necessarily serially |
| dependent, but some dependencies do exist. |
| |
| 1. Move properties into wc.db. |
| 2. Convert loggy actions into work queue actions. |
| 3. Move pristines into a SHA-1 based store. |
| 4. Consolidate metadata into the centralized system. |
| 5. Test, tweak and release! |
| |
| |
| The above items are milestones. There are a number of work items that |
| need to be completed in/around the milestones. The progress of this |
| work can be roughly measured by the tools/dev/wc-ng/count-progress.py |
| script: |
| |
| * remove use of svn_wc_adm_access_t and svn_wc_entry_t |
| * review/revamp the requirements, definitions, and use of the |
| svn_wc__node_* and svn_wc__db_temp_* functions |