|  | Anyone who has worked on the libsvn_wc code will have discovered that | 
|  | the current code is a complicated mess of special cases, and that it | 
|  | is difficult to understand, inconsistent, slow and buggy.  I know this | 
|  | because I wrote some of it.  It's possible that the libsvn_wc code | 
|  | will gradually evolve into an elegant, efficient, code base; on the | 
|  | other hand comments like "when we rewrite libsvn_wc" regularly appear | 
|  | on the dev list.  This document is *not* a plan or design for a | 
|  | rewrite, it's just some of the thoughts of a libsvn_wc hacker. | 
|  |  | 
|  | From Past to Present | 
|  | ==================== | 
|  |  | 
|  | The original code for libsvn_wc used an implementation that stored | 
|  | more or less all state information on disk in the .svn area on disk, | 
|  | so during most operations the entries files were read and written many | 
|  | times.  This led to the development of an API that passed around a lot | 
|  | of path parameters (as svn_string_t* originally, as const char* now) | 
|  | and to the development of the svn_io_xxx functions, which also accept | 
|  | path parameters.  The implementation was slow, and didn't scale | 
|  | particularly well as working copies got larger.  To improve things the | 
|  | current access baton and entries caching system was gradually hacked | 
|  | in, and libsvn_wc is now faster and scales a bit better, but problems | 
|  | still remain. | 
|  |  | 
|  | A good example of the problems caused by the "path as parameter" API | 
|  | is svn_wc_text_modified_p.  Its basic function is to determine if the | 
|  | base text file and the working file are the same or different, but | 
|  | physical IO operations have to be repeated because they are buried | 
|  | behind several layers of API.  It's difficult to fix without | 
|  | rewriting, or duplicating, a number of svn_io_xxx and svn_wc_xxx | 
|  | functions.  Aside from the repeated IO itself, each IO operation also | 
|  | has to repeat the UTF-8 to native path conversion. | 
|  |  | 
|  | The current entries caching makes things faster than in the past, but | 
|  | has its own problems.  Most operations now cache the entire entries | 
|  | hierarchy in memory which limits the size of the working copies that | 
|  | can be handled.  The problem is difficult to solve as some operations | 
|  | make multiple passes--commit for instance makes a first pass searching | 
|  | for modifications, a second pass reporting to the repository, and a | 
|  | third pass to do post-commit processing. | 
|  |  | 
|  | The original code also did not always make a distinction between the | 
|  | versioned hierarchy in the entries file and the physical hierarchy on | 
|  | disk.  Things like using stat() or svn_io_check_path() calls to | 
|  | determine whether an item was versioned as file or directory do not | 
|  | work when the working copy on disk is obstructed or incomplete. | 
|  |  | 
|  | The Future | 
|  | ========== | 
|  |  | 
|  | Some of these ideas are trivial, some of them are difficult to | 
|  | implement, some of them may not work at all. | 
|  |  | 
|  | - Have an svn_wc_t context object, opaque outside the library, that | 
|  | would replace the access batons.  This would get passed through most | 
|  | of the libsvn_wc functions and could read/cache the entries files on | 
|  | demand as the working copy was traversed.  It could also cache the | 
|  | UTF-8 xlate handle. | 
|  |  | 
|  | - Have an API to svn_wc_entry_t, perhaps make the struct opaque, so | 
|  | that things like URL need not be constructed when the entries file | 
|  | is read but can be created on demand if required and possibly cached | 
|  | once created.  The aim would be to reduce the memory used by the | 
|  | entries cache. | 
|  |  | 
|  | - Consider caching physical IO results in svn_wc_entry_t/svn_wc_t. | 
|  | Should we really stat() any file more than once?  This becomes less | 
|  | important as we reduce the number of IO operations. | 
|  |  | 
|  | - Consider caching UTF-8 to native path conversions either in | 
|  | svn_wc_t, or svn_wc_entry_t, or locally in functions and using | 
|  | svn_io_xxx equivalents that accept native paths.  This becomes less | 
|  | important as we reduce the number of IO operations. | 
|  |  | 
|  | - Make interfaces pass svn_wc_entry_t* rather than simple paths.  The | 
|  | public API using const char* paths would remain to be used by | 
|  | libsvn_client et al. | 
|  |  | 
|  | - Maintain a clear distinction between the versioned hierarchy and the | 
|  | physical hierarchy when writing code, it's usually a mistake to use | 
|  | one when the other should be used.  To this end, audit the use of | 
|  | svn_io_check_path(). | 
|  |  | 
|  | - Avoid using stat() to determine if an item is present on disk before | 
|  | using the item, just use it straight away and handle the error if it | 
|  | doesn't exist. | 
|  |  | 
|  | - Search out and destroy functions that read and discard entries files | 
|  | e.g. the apparently "simple" functions like svn_wc_is_wc_root or | 
|  | check_wc_root.  Such overhead is expensive when used by operations | 
|  | that are not going to do much other work, running status on a single | 
|  | file for example.  The overhead may not matter to a command line | 
|  | client, but it can matter to a GUI that makes many such calls. | 
|  |  | 
|  | - Consider supporting out of tree .svn directories. | 
|  |  | 
|  | - In the present code most operations are IO bound and have CPU to | 
|  | spare.  Perhaps compressed text-bases would make things faster | 
|  | rather than slower, by trading spare CPU for reduced IO? | 
|  |  | 
|  | - Keep track of the last text time written into an entries file and | 
|  | store it in svn_wc_t.  Then when we come to do a timestamp sleep | 
|  | we can do it from that time rather than the current time. | 
|  |  | 
|  | - Store working file size in the entries file and use it as another | 
|  | shortcut to detect modifications.  This should not need any extra | 
|  | system calls, the stat() for timestamp can also return the size. | 
|  | When it triggers it will be much faster than possibly detranslating | 
|  | and then doing a byte-by-byte comparison. | 
|  | ### Problem: This doesn't work when the file needs translation, because the | 
|  | ### file might be modified in such a way that these modifications disappear | 
|  | ### when the file is detranslated. | 
|  |  | 
|  | - Make the entries file smaller.  The properties committed-date | 
|  | committed-rev and last-author are really only needed for keyword | 
|  | expansion, so only store them if the appropriate svn:keywords value | 
|  | is present.  Note that committed-rev has a more general use as rPREV, | 
|  | however just about all uses of rPREV involve repository access so | 
|  | rPREV could be determined via an RA call.  Removing the three | 
|  | properties could reduce entries file size by as much as one third, | 
|  | it's possible that might make reading, writing and parsing faster. | 
|  | It would reduce the memory used to cache the entries, an ABI change | 
|  | to svn_wc_entry_t might reduce it further. | 
|  |  | 
|  | - Look at calls to svn_wc__get_keywords and svn_wc__get_eol_style | 
|  | Each of those reads the properties file.  If they occur together | 
|  | then consider replacing them with a single call to svn_wc_prop_list, | 
|  | and perhaps write some functions that accept the properties hash | 
|  | as an argument.  Alternatively, consider caching the existence of | 
|  | these two properties in the entries file to avoid reading the props | 
|  | file at all in some cases. | 
|  |  | 
|  | - Optimise update/incomplete handling to reduce the number of times | 
|  | the entry file gets written. | 
|  | http://svn.haxx.se/dev/archive-2005-03/0060.shtml | 
|  |  | 
|  | * Avoid adding incomplete="true" if the revision is not changing. | 
|  |  | 
|  | * Don't write incomplete="true" immediately, cache it in the access | 
|  | baton and only write it when next writing the entries file. | 
|  |  | 
|  | * Combine removing incomplete="true", and revision bumping, with the | 
|  | last change due to the update. | 
|  |  | 
|  | - The svn_wc_t context could work in conjunction with a more advanced | 
|  | svn_wc_crawl_revisions system.  This would provide a way of plugging | 
|  | multiple callbacks into a queue, probably with some sort of ordering | 
|  | and filtering ability, the aim being to replace most/all of the | 
|  | existing explicit loops.  This would put more of the pool handling in | 
|  | one central location, it may even be possible to provide different | 
|  | entry caching schemes.  I don't know how practical this idea is, or | 
|  | even if it is desirable. | 
|  |  | 
|  | - Have a .svn/deleted directory so that schedule delete directories | 
|  | can be moved out of the working copy. At present a skeleton hierarchy | 
|  | of schedule delete directories remains in the working copy until the | 
|  | delete is committed. | 
|  |  | 
|  | - When handling a delete received during update/switch perhaps do it | 
|  | in two stages.  First move the item into a holding area within .svn | 
|  | and finally delete all such items at the end of the update.  This | 
|  | would allow adds-with-history to use the deleted item and so might | 
|  | be a way to handle moves (implemented as delete plus add) in the | 
|  | presence of local modifications.  Thought would have to be given to | 
|  | the revision of the local deleted item, what happens if it doesn't | 
|  | match the copyfrom revision?  Perhaps we could get diffs, rather | 
|  | than full text, for adds-with-history if the copyfrom source is | 
|  | reported to the repository? | 
|  |  | 
|  | - Consider implementing atomic move for wc-to-wc moves, rather than | 
|  | using copy+delete.  This would be considerably faster for big | 
|  | directories, would lead to better revert behaviour, and avoid | 
|  | case-insensitivity problems (and if we ever get atomic mv in | 
|  | libsvn_fs then the wc code would be ready for it). | 
|  |  | 
|  | - Consider writing some libsvn_wc compiled C regression tests to allow | 
|  | more complete coverage.  Most of the current libsvn_wc testing is | 
|  | done via the command line client and it can be hard to get a working | 
|  | copy into the state necessary to test all code paths. | 
|  |  | 
|  | - There are some basic features that are fragile.  Switch has some | 
|  | bugs that can break a working copy, see issue 1906.  I don't know | 
|  | how the system is supposed to work in theory, let alone how it | 
|  | should be implemented.  Non-recursive checkout is broken, see issue | 
|  | 695; this probably applies to non-recursive update and switch as well. | 
|  |  | 
|  | - Use absolute paths within libsvn_wc so that "." is not automatically | 
|  | a wc root. | 
|  |  | 
|  | - Read notes/entries-caching for some details of the logging/caching | 
|  | in the current libsvn_wc.  It's important that writing the entries | 
|  | file is handled efficiently. |