notes/entries-caching - subversion - Git at Google


                     "I have a cunning plan"

                              or

              Entries Caching in the Access Batons


 0. Preamble
    --------

 Entries caching appears to be a good idea for making the client and WC
 libraries faster.  There has been some discussion about this, but how
 it could be implemented has never really been written down.  I have a
 mental picture of how entries caching could work, but the picture is a
 little blurred in places, and I'm a bit worried that the dark bit in
 the corner (probably my thumb) may be obscuring important details.

 The solution being developed as part of issue 749 is to use the
 svn_wc_adm_access_t access batons to cache the results of
 svn_wc_entries_read, so that the .svn/entries file does not need to
 read and parsed repeatedly.

 What makes this hard is that the entries file is currently accessed in
 a large number of places in the code.  If we attempt to introduce
 caching gradually there is a danger that we will mix code that uses
 the cache with code that access the entries file directly.  Such
 mixing is not a good idea, as it is possible that the cache and
 entries file may get out of sync.  Even if we could ensure that each
 client operation used the cache consistently (could we do that?)  it
 would make future development hard, as we would need to ensure that
 such consistency didn't break.  Introducing caching everywhere in a
 single step is better, but the code changes to do it would be
 gigantic.


 1. Caching Interface
    -----------------

 The plan is to identify the places where there needs to be an access
 baton, and then make all the changes required to pass access batons
 around within the code, but without attempting to introduce the
 caching code.  This is being done in stages.  Once the access baton is
 in place, I hope that it will then be possible to start using caching
 everywhere in a single step.

 The basic functions to retrieve entries are svn_wc_entries_read and
 svn_wc_entry.  The function svn_wc__entries_write is used to update
 the entries file on disk.  Simple really, only three functions, and
 once the access baton gets this far we are more or less done!  The
 trouble is that these functions are used everywhere, so the batons
 have to be passed through a large number of other functions.

 The basic caching read interface will consist of svn_wc_entry for a
 single entry and svn_wc_entries_read for a hash of all entries, just
 as it does now.  Initially these functions will work exactly as they
 do right now, except they will have gained an additional access baton
 parameter.  Once the functions support caching then switching caching
 on should just involve very localised changes, as the entry interface
 is the same with and without caching.

 In the longer term it may be that svn_wc_entries_read will be removed
 in favour of providing a set of functions that access the underlying
 cache, thus allowing the access baton to track changes made.  However
 initially I do not think this will be required, if the current code
 gets a hash from svn_wc_entries_read and expects it to remain valid
 then that expectation should still apply when caching is implemented.
 PROBLEM: I have identified one place in mark_tree where the hash is
 retrieved, one entry is extracted and modified.  I can work around
 this case, but it shows that we really need a more robust interface.
 Perhaps svn_wc_entries_read in its current form should be removed, and
 replaced by some functions returning const svn_wc_entry_t pointers.

 At present access batons have a fairly strict interface, they must be
 passed directory names, and the code always "knows" whether it is
 supposed to have a baton for a particular directory or not (and thus
 it knows whether to call svn_wc_adm_open or svn_wc_adm_retrieve).  One
 tricky point is that svn_wc_entry is often called first, before any
 access batons are opened, to determine if a given path represents a
 versioned file or a versioned directory.  However svn_wc_entry falls
 back on checking the physical working copy, so this functionality will
 probably be copied or moved into an access baton convenience function
 that allows opening an access baton without requiring knowledge of
 whether the path is a file or a directory.

 The basic caching write interface is svn_wc__entries_write.  Initially
 this will write directly to the entries file, just as it currently
 does.  Later on, modifications may be cached until an explicit
 entries_flush call is made.  I haven't yet determined whether this
 would be a significant benefit in terms of speed, or whether it would
 risk losing changes if a process is interrupted.

 The function svn_wc__entry_modify is written in terms of entries_read
 and entries_write and has already been converted to take an access
 baton.


 2. Access Baton Sets
    -----------------

 Each access baton represents a directory.  Access batons can associate
 together in sets.  Given an access baton in a set, it possible to
 retrieve any other access baton in the set.  When an access baton in a
 set is closed, all other access batons in the set that represent
 subdirectories are also closed.  The set is implemented as a hash
 table "owned" by the one baton in any set, but shared by all batons in
 the set.

 At present in the code, access batons are opened in a parent->child
 order.  This works well with the shared hash being owned by the first
 baton in each set.  There is code to detect if closing a baton will
 destroy the hash while other batons are using it, as far as I know it
 doesn't currently trigger.  If it turns out that this needs to be
 supported it should be possible to transfer the hash information to
 another baton.

 3. Caching Mechanism
    -----------------

 Each access baton will cache the two possible hashes returned by
 svn_wc_entries_read, so that subsequent calls will not need to parse
 the entries file.  If the full hash, the one containing deleted
 entries, is available when a request for the truncated hash is made,
 then the truncated hash will be constructed from the full hash.

 The function svn_wc__entries_write will cause the full hash cache to
 be filled and the truncated hash cache to be cleared.

 PROBLEM: memory use is going to be a problem, if we simply repeatedly
 allocate from the access baton pool as happens at present.  As the
 entries get updated we need to find a way to reuse the cache memory,
 otherwise memory usage for checkout is going to be proportional to the
 number of entries in a directory.


 4. Access Baton Conversion
    -----------------------

 Given a function
   svn_error_t *foo (const char *path);
 if PATH is always a directory then the change that gets made is usually
   svn_error_t *foo (svn_wc_adm_access_t *adm_access);
 Within foo, the original const char* can be obtained using
   const char *svn_wc_adm_access_path(svn_wc_adm_access_t *adm_access);

 The above case sometimes occurs as
   svn_error_t *foo(const char *name, const char *dir);
 where NAME is a single path component, and DIR is a directory. Conversion
 is again simply in this case
   svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access);

 The more difficult case is
   svn_error_t *foo (const char *path);
 where PATH can be a file or a directory.  This occurs a lot in the
 current code. In the long term these may get converted to
   svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access);
 where NAME is a single path component.  However this involves more
 changes to the code calling foo than are strictly necessary, so
 initially they get converted to
   svn_error_t *foo (const char *path, svn_wc_adm_access_t *adm_access);
 where PATH is passed unchanged and an additional access baton is
 passed.  This interface is less than ideal, since there is duplicate
 information in the path and baton, but since it involves fewer changes
 in the calling code it makes a reasonable intermediate step.


 5. Logging
    -------

 As well as caching the other problem that needs to be addressed is the
 issue of logging.  Modifications to the working copy are supposed to
 use the log file mechanism to ensure that multiple changes that need
 to be atomic cannot be partially completed.  If the individual changes
 that may need to be logged are all forced to use an access baton, then
 the access baton may be able to identify when the log file mechanism
 should be used.  Combine this with an access baton state that tracks
 whether a log file is being run and we may be able to automatically
 identify those places that are failing to use the log file mechanism.


 6. Status
    ------

 The access baton is now in place in the majority of places required to
 implement caching.  There is also a partial caching implementation in
 place.

	"I have a cunning plan"

	or

	Entries Caching in the Access Batons



	0. Preamble
	--------

	Entries caching appears to be a good idea for making the client and WC
	libraries faster. There has been some discussion about this, but how
	it could be implemented has never really been written down. I have a
	mental picture of how entries caching could work, but the picture is a
	little blurred in places, and I'm a bit worried that the dark bit in
	the corner (probably my thumb) may be obscuring important details.

	The solution being developed as part of issue 749 is to use the
	svn_wc_adm_access_t access batons to cache the results of
	svn_wc_entries_read, so that the .svn/entries file does not need to
	read and parsed repeatedly.

	What makes this hard is that the entries file is currently accessed in
	a large number of places in the code. If we attempt to introduce
	caching gradually there is a danger that we will mix code that uses
	the cache with code that access the entries file directly. Such
	mixing is not a good idea, as it is possible that the cache and
	entries file may get out of sync. Even if we could ensure that each
	client operation used the cache consistently (could we do that?) it
	would make future development hard, as we would need to ensure that
	such consistency didn't break. Introducing caching everywhere in a
	single step is better, but the code changes to do it would be
	gigantic.


	1. Caching Interface
	-----------------

	The plan is to identify the places where there needs to be an access
	baton, and then make all the changes required to pass access batons
	around within the code, but without attempting to introduce the
	caching code. This is being done in stages. Once the access baton is
	in place, I hope that it will then be possible to start using caching
	everywhere in a single step.

	The basic functions to retrieve entries are svn_wc_entries_read and
	svn_wc_entry. The function svn_wc__entries_write is used to update
	the entries file on disk. Simple really, only three functions, and
	once the access baton gets this far we are more or less done! The
	trouble is that these functions are used everywhere, so the batons
	have to be passed through a large number of other functions.

	The basic caching read interface will consist of svn_wc_entry for a
	single entry and svn_wc_entries_read for a hash of all entries, just
	as it does now. Initially these functions will work exactly as they
	do right now, except they will have gained an additional access baton
	parameter. Once the functions support caching then switching caching
	on should just involve very localised changes, as the entry interface
	is the same with and without caching.

	In the longer term it may be that svn_wc_entries_read will be removed
	in favour of providing a set of functions that access the underlying
	cache, thus allowing the access baton to track changes made. However
	initially I do not think this will be required, if the current code
	gets a hash from svn_wc_entries_read and expects it to remain valid
	then that expectation should still apply when caching is implemented.
	PROBLEM: I have identified one place in mark_tree where the hash is
	retrieved, one entry is extracted and modified. I can work around
	this case, but it shows that we really need a more robust interface.
	Perhaps svn_wc_entries_read in its current form should be removed, and
	replaced by some functions returning const svn_wc_entry_t pointers.

	At present access batons have a fairly strict interface, they must be
	passed directory names, and the code always "knows" whether it is
	supposed to have a baton for a particular directory or not (and thus
	it knows whether to call svn_wc_adm_open or svn_wc_adm_retrieve). One
	tricky point is that svn_wc_entry is often called first, before any
	access batons are opened, to determine if a given path represents a
	versioned file or a versioned directory. However svn_wc_entry falls
	back on checking the physical working copy, so this functionality will
	probably be copied or moved into an access baton convenience function
	that allows opening an access baton without requiring knowledge of
	whether the path is a file or a directory.

	The basic caching write interface is svn_wc__entries_write. Initially
	this will write directly to the entries file, just as it currently
	does. Later on, modifications may be cached until an explicit
	entries_flush call is made. I haven't yet determined whether this
	would be a significant benefit in terms of speed, or whether it would
	risk losing changes if a process is interrupted.

	The function svn_wc__entry_modify is written in terms of entries_read
	and entries_write and has already been converted to take an access
	baton.


	2. Access Baton Sets
	-----------------

	Each access baton represents a directory. Access batons can associate
	together in sets. Given an access baton in a set, it possible to
	retrieve any other access baton in the set. When an access baton in a
	set is closed, all other access batons in the set that represent
	subdirectories are also closed. The set is implemented as a hash
	table "owned" by the one baton in any set, but shared by all batons in
	the set.

	At present in the code, access batons are opened in a parent->child
	order. This works well with the shared hash being owned by the first
	baton in each set. There is code to detect if closing a baton will
	destroy the hash while other batons are using it, as far as I know it
	doesn't currently trigger. If it turns out that this needs to be
	supported it should be possible to transfer the hash information to
	another baton.

	3. Caching Mechanism
	-----------------

	Each access baton will cache the two possible hashes returned by
	svn_wc_entries_read, so that subsequent calls will not need to parse
	the entries file. If the full hash, the one containing deleted
	entries, is available when a request for the truncated hash is made,
	then the truncated hash will be constructed from the full hash.

	The function svn_wc__entries_write will cause the full hash cache to
	be filled and the truncated hash cache to be cleared.

	PROBLEM: memory use is going to be a problem, if we simply repeatedly
	allocate from the access baton pool as happens at present. As the
	entries get updated we need to find a way to reuse the cache memory,
	otherwise memory usage for checkout is going to be proportional to the
	number of entries in a directory.


	4. Access Baton Conversion
	-----------------------

	Given a function
	svn_error_t foo (const char path);
	if PATH is always a directory then the change that gets made is usually
	svn_error_t foo (svn_wc_adm_access_t adm_access);
	Within foo, the original const char* can be obtained using
	const char svn_wc_adm_access_path(svn_wc_adm_access_t adm_access);

	The above case sometimes occurs as
	svn_error_t foo(const char name, const char *dir);
	where NAME is a single path component, and DIR is a directory. Conversion
	is again simply in this case
	svn_error_t foo (const char name, svn_wc_adm_access_t *adm_access);

	The more difficult case is
	svn_error_t foo (const char path);
	where PATH can be a file or a directory. This occurs a lot in the
	current code. In the long term these may get converted to
	svn_error_t foo (const char name, svn_wc_adm_access_t *adm_access);
	where NAME is a single path component. However this involves more
	changes to the code calling foo than are strictly necessary, so
	initially they get converted to
	svn_error_t foo (const char path, svn_wc_adm_access_t *adm_access);
	where PATH is passed unchanged and an additional access baton is
	passed. This interface is less than ideal, since there is duplicate
	information in the path and baton, but since it involves fewer changes
	in the calling code it makes a reasonable intermediate step.


	5. Logging
	-------

	As well as caching the other problem that needs to be addressed is the
	issue of logging. Modifications to the working copy are supposed to
	use the log file mechanism to ensure that multiple changes that need
	to be atomic cannot be partially completed. If the individual changes
	that may need to be logged are all forced to use an access baton, then
	the access baton may be able to identify when the log file mechanism
	should be used. Combine this with an access baton state that tracks
	whether a log file is being run and we may be able to automatically
	identify those places that are failing to use the log file mechanism.


	6. Status
	------

	The access baton is now in place in the majority of places required to
	implement caching. There is also a partial caching implementation in
	place.