blob: 1c332c3fc04f6b9d8162fd2d6701ceaacba80d76 [file] [log] [blame]
"I have a cunning plan"
or
Entries Caching in the Access Batons
0. Preamble
--------
Entries caching appears to be a good idea for making the client and WC
libraries faster. There has been some discussion about this, but how
it could be implemented has never really been written down. I have a
mental picture of how entries caching could work, but the picture is a
little blurred in places, and I'm a bit worried that the dark bit in
the corner (probably my thumb) may be obscuring important details.
The solution being developed as part of issue 749 is to use the
svn_wc_adm_access_t access batons to cache the results of
svn_wc_entries_read, so that the .svn/entries file does not need to
read and parsed repeatedly.
What makes this hard is that the entries file is currently accessed in
a large number of places in the code. If we attempt to introduce
caching gradually there is a danger that we will mix code that uses
the cache with code that access the entries file directly. Such
mixing is not a good idea, as it is possible that the cache and
entries file may get out of sync. Even if we could ensure that each
client operation used the cache consistently (could we do that?) it
would make future development hard, as we would need to ensure that
such consistency didn't break. Introducing caching everywhere in a
single step is better, but the code changes to do it would be
gigantic.
1. Caching Interface
-----------------
The plan is to identify the places where there needs to be an access
baton, and then make all the changes required to pass access batons
around within the code, but without attempting to introduce the
caching code. This is being done in stages. Once the access baton is
in place, I hope that it will then be possible to start using caching
everywhere in a single step.
The basic functions to retrieve entries are svn_wc_entries_read and
svn_wc_entry. The function svn_wc__entries_write is used to update
the entries file on disk. Simple really, only three functions, and
once the access baton gets this far we are more or less done! The
trouble is that these functions are used everywhere, so the batons
have to be passed through a large number of other functions.
The basic caching read interface will consist of svn_wc_entry for a
single entry and svn_wc_entries_read for a hash of all entries, just
as it does now. Initially these functions will work exactly as they
do right now, except they will have gained an additional access baton
parameter. Once the functions support caching then switching caching
on should just involve very localised changes, as the entry interface
is the same with and without caching.
In the longer term it may be that svn_wc_entries_read will be removed
in favour of providing a set of functions that access the underlying
cache, thus allowing the access baton to track changes made. However
initially I do not think this will be required, if the current code
gets a hash from svn_wc_entries_read and expects it to remain valid
then that expectation should still apply when caching is implemented.
PROBLEM: I have identified one place in mark_tree where the hash is
retrieved, one entry is extracted and modified. I can work around
this case, but it shows that we really need a more robust interface.
Perhaps svn_wc_entries_read in its current form should be removed, and
replaced by some functions returning const svn_wc_entry_t pointers.
At present access batons have a fairly strict interface, they must be
passed directory names, and the code always "knows" whether it is
supposed to have a baton for a particular directory or not (and thus
it knows whether to call svn_wc_adm_open or svn_wc_adm_retrieve). One
tricky point is that svn_wc_entry is often called first, before any
access batons are opened, to determine if a given path represents a
versioned file or a versioned directory. However svn_wc_entry falls
back on checking the physical working copy, so this functionality will
probably be copied or moved into an access baton convenience function
that allows opening an access baton without requiring knowledge of
whether the path is a file or a directory.
The basic caching write interface is svn_wc__entries_write. Initially
this will write directly to the entries file, just as it currently
does. Later on, modifications may be cached until an explicit
entries_flush call is made. I haven't yet determined whether this
would be a significant benefit in terms of speed, or whether it would
risk losing changes if a process is interrupted.
The function svn_wc__entry_modify is written in terms of entries_read
and entries_write and has already been converted to take an access
baton.
2. Access Baton Sets
-----------------
Each access baton represents a directory. Access batons can associate
together in sets. Given an access baton in a set, it possible to
retrieve any other access baton in the set. When an access baton in a
set is closed, all other access batons in the set that represent
subdirectories are also closed. The set is implemented as a hash
table "owned" by the one baton in any set, but shared by all batons in
the set.
At present in the code, access batons are opened in a parent->child
order. This works well with the shared hash being owned by the first
baton in each set. There is code to detect if closing a baton will
destroy the hash while other batons are using it, as far as I know it
doesn't currently trigger. If it turns out that this needs to be
supported it should be possible to transfer the hash information to
another baton.
3. Caching Mechanism
-----------------
Each access baton will cache the two possible hashes returned by
svn_wc_entries_read, so that subsequent calls will not need to parse
the entries file. If the full hash, the one containing deleted
entries, is available when a request for the truncated hash is made,
then the truncated hash will be constructed from the full hash.
The function svn_wc__entries_write will cause the full hash cache to
be filled and the truncated hash cache to be cleared.
PROBLEM: memory use is going to be a problem, if we simply repeatedly
allocate from the access baton pool as happens at present. As the
entries get updated we need to find a way to reuse the cache memory,
otherwise memory usage for checkout is going to be proportional to the
number of entries in a directory.
4. Access Baton Conversion
-----------------------
Given a function
svn_error_t *foo (const char *path);
if PATH is always a directory then the change that gets made is usually
svn_error_t *foo (svn_wc_adm_access_t *adm_access);
Within foo, the original const char* can be obtained using
const char *svn_wc_adm_access_path(svn_wc_adm_access_t *adm_access);
The above case sometimes occurs as
svn_error_t *foo(const char *name, const char *dir);
where NAME is a single path component, and DIR is a directory. Conversion
is again simply in this case
svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access);
The more difficult case is
svn_error_t *foo (const char *path);
where PATH can be a file or a directory. This occurs a lot in the
current code. In the long term these may get converted to
svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access);
where NAME is a single path component. However this involves more
changes to the code calling foo than are strictly necessary, so
initially they get converted to
svn_error_t *foo (const char *path, svn_wc_adm_access_t *adm_access);
where PATH is passed unchanged and an additional access baton is
passed. This interface is less than ideal, since there is duplicate
information in the path and baton, but since it involves fewer changes
in the calling code it makes a reasonable intermediate step.
5. Logging
-------
As well as caching the other problem that needs to be addressed is the
issue of logging. Modifications to the working copy are supposed to
use the log file mechanism to ensure that multiple changes that need
to be atomic cannot be partially completed. If the individual changes
that may need to be logged are all forced to use an access baton, then
the access baton may be able to identify when the log file mechanism
should be used. Combine this with an access baton state that tracks
whether a log file is being run and we may be able to automatically
identify those places that are failing to use the log file mechanism.
6. Status
------
The access baton is now in place in the majority of places required to
implement caching. There is also a partial caching implementation in
place.