| |
| "I have a cunning plan" |
| |
| or |
| |
| Entries Caching in the Access Batons |
| |
| |
| |
| 0. Preamble |
| -------- |
| |
| Entries caching appears to be a good idea for making the client and WC |
| libraries faster. There has been some discussion about this, but how |
| it could be implemented has never really been written down. I have a |
| mental picture of how entries caching could work, but the picture is a |
| little blurred in places, and I'm a bit worried that the dark bit in |
| the corner (probably my thumb) may be obscuring important details. |
| |
| The solution being developed as part of issue 749 is to use the |
| svn_wc_adm_access_t access batons to cache the results of |
| svn_wc_entries_read, so that the .svn/entries file does not need to |
| read and parsed repeatedly. |
| |
| What makes this hard is that the entries file is currently accessed in |
| a large number of places in the code. If we attempt to introduce |
| caching gradually there is a danger that we will mix code that uses |
| the cache with code that access the entries file directly. Such |
| mixing is not a good idea, as it is possible that the cache and |
| entries file may get out of sync. Even if we could ensure that each |
| client operation used the cache consistently (could we do that?) it |
| would make future development hard, as we would need to ensure that |
| such consistency didn't break. Introducing caching everywhere in a |
| single step is better, but the code changes to do it would be |
| gigantic. |
| |
| |
| 1. Caching Interface |
| ----------------- |
| |
| The plan is to identify the places where there needs to be an access |
| baton, and then make all the changes required to pass access batons |
| around within the code, but without attempting to introduce the |
| caching code. This is being done in stages. Once the access baton is |
| in place, I hope that it will then be possible to start using caching |
| everywhere in a single step. |
| |
| The basic functions to retrieve entries are svn_wc_entries_read and |
| svn_wc_entry. The function svn_wc__entries_write is used to update |
| the entries file on disk. Simple really, only three functions, and |
| once the access baton gets this far we are more or less done! The |
| trouble is that these functions are used everywhere, so the batons |
| have to be passed through a large number of other functions. |
| |
| The basic caching read interface will consist of svn_wc_entry for a |
| single entry and svn_wc_entries_read for a hash of all entries, just |
| as it does now. Initially these functions will work exactly as they |
| do right now, except they will have gained an additional access baton |
| parameter. Once the functions support caching then switching caching |
| on should just involve very localised changes, as the entry interface |
| is the same with and without caching. |
| |
| In the longer term it may be that svn_wc_entries_read will be removed |
| in favour of providing a set of functions that access the underlying |
| cache, thus allowing the access baton to track changes made. However |
| initially I do not think this will be required, if the current code |
| gets a hash from svn_wc_entries_read and expects it to remain valid |
| then that expectation should still apply when caching is implemented. |
| PROBLEM: I have identified one place in mark_tree where the hash is |
| retrieved, one entry is extracted and modified. I can work around |
| this case, but it shows that we really need a more robust interface. |
| Perhaps svn_wc_entries_read in its current form should be removed, and |
| replaced by some functions returning const svn_wc_entry_t pointers. |
| |
| At present access batons have a fairly strict interface, they must be |
| passed directory names, and the code always "knows" whether it is |
| supposed to have a baton for a particular directory or not (and thus |
| it knows whether to call svn_wc_adm_open or svn_wc_adm_retrieve). One |
| tricky point is that svn_wc_entry is often called first, before any |
| access batons are opened, to determine if a given path represents a |
| versioned file or a versioned directory. However svn_wc_entry falls |
| back on checking the physical working copy, so this functionality will |
| probably be copied or moved into an access baton convenience function |
| that allows opening an access baton without requiring knowledge of |
| whether the path is a file or a directory. |
| |
| The basic caching write interface is svn_wc__entries_write. Initially |
| this will write directly to the entries file, just as it currently |
| does. Later on, modifications may be cached until an explicit |
| entries_flush call is made. I haven't yet determined whether this |
| would be a significant benefit in terms of speed, or whether it would |
| risk losing changes if a process is interrupted. |
| |
| The function svn_wc__entry_modify is written in terms of entries_read |
| and entries_write and has already been converted to take an access |
| baton. |
| |
| |
| 2. Access Baton Sets |
| ----------------- |
| |
| Each access baton represents a directory. Access batons can associate |
| together in sets. Given an access baton in a set, it possible to |
| retrieve any other access baton in the set. When an access baton in a |
| set is closed, all other access batons in the set that represent |
| subdirectories are also closed. The set is implemented as a hash |
| table "owned" by the one baton in any set, but shared by all batons in |
| the set. |
| |
| At present in the code, access batons are opened in a parent->child |
| order. This works well with the shared hash being owned by the first |
| baton in each set. There is code to detect if closing a baton will |
| destroy the hash while other batons are using it, as far as I know it |
| doesn't currently trigger. If it turns out that this needs to be |
| supported it should be possible to transfer the hash information to |
| another baton. |
| |
| 3. Caching Mechanism |
| ----------------- |
| |
| Each access baton will cache the two possible hashes returned by |
| svn_wc_entries_read, so that subsequent calls will not need to parse |
| the entries file. If the full hash, the one containing deleted |
| entries, is available when a request for the truncated hash is made, |
| then the truncated hash will be constructed from the full hash. |
| |
| The function svn_wc__entries_write will cause the full hash cache to |
| be filled and the truncated hash cache to be cleared. |
| |
| PROBLEM: memory use is going to be a problem, if we simply repeatedly |
| allocate from the access baton pool as happens at present. As the |
| entries get updated we need to find a way to reuse the cache memory, |
| otherwise memory usage for checkout is going to be proportional to the |
| number of entries in a directory. |
| |
| |
| 4. Access Baton Conversion |
| ----------------------- |
| |
| Given a function |
| svn_error_t *foo (const char *path); |
| if PATH is always a directory then the change that gets made is usually |
| svn_error_t *foo (svn_wc_adm_access_t *adm_access); |
| Within foo, the original const char* can be obtained using |
| const char *svn_wc_adm_access_path(svn_wc_adm_access_t *adm_access); |
| |
| The above case sometimes occurs as |
| svn_error_t *foo(const char *name, const char *dir); |
| where NAME is a single path component, and DIR is a directory. Conversion |
| is again simply in this case |
| svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access); |
| |
| The more difficult case is |
| svn_error_t *foo (const char *path); |
| where PATH can be a file or a directory. This occurs a lot in the |
| current code. In the long term these may get converted to |
| svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access); |
| where NAME is a single path component. However this involves more |
| changes to the code calling foo than are strictly necessary, so |
| initially they get converted to |
| svn_error_t *foo (const char *path, svn_wc_adm_access_t *adm_access); |
| where PATH is passed unchanged and an additional access baton is |
| passed. This interface is less than ideal, since there is duplicate |
| information in the path and baton, but since it involves fewer changes |
| in the calling code it makes a reasonable intermediate step. |
| |
| |
| 5. Logging |
| ------- |
| |
| As well as caching the other problem that needs to be addressed is the |
| issue of logging. Modifications to the working copy are supposed to |
| use the log file mechanism to ensure that multiple changes that need |
| to be atomic cannot be partially completed. If the individual changes |
| that may need to be logged are all forced to use an access baton, then |
| the access baton may be able to identify when the log file mechanism |
| should be used. Combine this with an access baton state that tracks |
| whether a log file is being run and we may be able to automatically |
| identify those places that are failing to use the log file mechanism. |
| |
| |
| 6. Status |
| ------ |
| |
| The access baton is now in place in the majority of places required to |
| implement caching. There is also a partial caching implementation in |
| place. |