notes/entries-caching - subversion - Git at Google


                     "I have a cunning plan"

                              or

              Entries Caching in the Access Batons


 0. Preamble
    --------

 Issue 749 provides some history.  The access batons now cache the
 parsed entries file, as repeatedly reading, parsing and writing the
 file proved to be a bottleneck.


 1. Caching Interface
    -----------------

 The basic functions to retrieve entries are svn_wc_entries_read and
 svn_wc_entry.  The function svn_wc__entries_write is used to update
 the entries file on disk.  The function svn_wc__entry_modify is
 implemented in terms of entries_read and entries_write.

 Caching during read-only operations is essentially complete.  For
 read-write operations one potential bottleneck remains: calling
 svn_wc__entries_write repeatedly, usually through the function
 svn_wc__entry_modify.  This writes the entire file even if only one
 entry has changed.  A partial solution is in place: when running a log
 file any changes are accumulated in the access baton and only finally
 written when the log file completes.  There is scope to use the same
 mechanism in other places, although it may be best to refrain from
 doing this until the interface changes (see section 2) are resolved.
 A more substantial enhancement is described below.

 1.1 Write Caching Enhancement

 An overview of the update process

    1. Lock the directory
    2. Read the entries file and cache in memory
    3. Start the wc update
       3.1. Receive an update to a file
       3.2. Write a log file
       3.3. Start running log file
          3.3.1. Modify entries in memory
       3.4  Finish log file
       3.5. Flush entries to disk
       3.6. Remove log file
    4. Finish update
    5. Unlock directory

 Each log file may make multiple modifications to the entries file
 (step 3.3.1 above), and each wc update may contain updates for
 multiple files (step 3 above).  Multiple modifications during a log
 file are cached in memory and so are not a problem, but multiple files
 result in multiple writes of the entries file (step 3.5 above).
 Writing the entries file is a non-trivial operation, and doing it
 repeatedly means that the run-time increases is non-linearly as the
 number of entries in the directory increases.

 Simply caching the entries modifications in memory for the complete
 update is not possible, it would break the atomic guarantees provided
 by the log file system.  When the log file is removed all entries
 modifications need to be present on disk, otherwise an interrupt will
 result in an inconsistent working copy.

 One solution is to track which entries get changed while running a log
 file, and write a temporary, partial, entries file instead of the
 complete file at the end of each log file (step 3.5 above). Then there
 would need to be a write of the complete file only once at the end of
 the update (step 4 above).  With this system, if the update is
 interrupted the cleanup mechanism would recognise the partial entries
 files and combine them with the out-of-date entries file on disk.  The
 temporary, partial entries files need to be written atomically to
 avoid problems with incomplete writes, and they need to be ordered so
 that they can be read in the order they were written.  They also need
 to be cleaned up before removing the lock.  The mechanism to track
 modified entries would mark them all unmodified as each temporary,
 partial, entries file is written.  Note: during a normal update the
 temporary, partial, entries files do not need to be read.


 2. Interface Enhancements
    ----------------------

 2.1 Entries Interface

 A lot of the entries interface has remained unchanged since the
 pre-caching days, and it shows.  Of particular concern is the
 svn_wc_entries_read function, as this provides access to the raw data
 within the cache.  If the application carelessly modifies the data
 things may go wrong.  I would like to remove this function.

 One use of svn_wc_entries_read is in svn_wc__entry_modify, this is
 "within the entries code" and so is not a problem.

 Of the other uses of svn_wc_entries_read the most common is where the
 application wants to iterate over all the entries in a directory. I
 would like to see an interface something like

   typedef struct svn_wc_entry_iterator_t svn_wc_entry_iterator_t;

   svn_wc_entry_iterator_t *
   svn_wc_entry_first(svn_wc_adm_access_t *adm_access,
                      apr_pool_t *pool);

   svn_wc_entry_iterator_t *
   svn_wc_entry_next(svn_wc_entry_iterator_t *entry_iterator);

   const svn_wc_entry_t *
   svn_wc_entry_iterator_entry(svn_wc_entry_iterator_t *entry_iterator);

 Note that this provides only const access to the entries, the
 application cannot modify the cached data.  All modifications would go
 through svn_wc__entry_modify, and the access batons could keep track
 of whether modifications have been made and not yet written to disk.

 The other uses of svn_wc_entries_read tend to extract a single entry.
 I hope these can be converted to use svn_wc_entry.  One slight problem
 is the use of svn_wc_entries_read to intentionally extract a
 directory's entry from its parent.  This is done because that's where
 the "deleted" state is stored.  I think the entry returned by
 svn_wc_entry could contain this state.  Why doesn't it?  I don't know,
 possibly it's an accident, or possibly it's intentional as in the past
 parsing two entries files would have been expensive.

 2.2 Access Baton Interface

 I would also like to modify the access baton interface.  At present
 the open function detects and skips missing directories when opening a
 directory hierarchy.  I would like to record this information in the
 access baton set, and modify the retrieve functions to include an
 svn_boolean_t* parameter that gets set TRUE when a request for a
 missing directory is made.  The advantage of doing this is that the
 application could avoid making svn_io_check_path and svn_wc_check_wc
 calls when the access baton already has the information.  The function
 prop_path_internal looks like a good candidate for this optimisation.


 3. Access Baton Sets
    -----------------

 Each access baton represents a directory.  Access batons can associate
 together in sets.  Given an access baton in a set, it possible to
 retrieve any other access baton in the set.  When an access baton in a
 set is closed, all other access batons in the set that represent
 subdirectories are also closed.  The set is implemented as a hash
 table "owned" by the one baton in any set, but shared by all batons in
 the set.

 At present in the code, access batons are opened in a parent->child
 order.  This works well with the shared hash being owned by the first
 baton in each set.  There is code to detect if closing a baton will
 destroy the hash while other batons are using it, as far as I know it
 doesn't currently trigger.  If it turns out that this needs to be
 supported it should be possible to transfer the hash information to
 another baton.

 The function that opens access batons takes a flag that indicates
 whether to open a single directory, or the complete tree.  When a
 single directory is opened only partial entry information is available
 for sub-directories, since getting full information requires reading
 the sub-directory's entry file.  An enhancement would be to replace
 the boolean with a tristate, to open just the one directory, the whole
 tree, or the one directory and it's immediate sub-directories.  This
 may be useful for the non-recursive operations, some of which have to
 open the entire tree because they need full entry information for the
 sub-directories.  The recursive/non-recursive stuff really needs to be
 revisited in most of the client and wc libraries (an issue exists).


 4. Access Baton Conversion
    -----------------------

 Given a function
   svn_error_t *foo (const char *path);
 if PATH is always a directory then the change that gets made is usually
   svn_error_t *foo (svn_wc_adm_access_t *adm_access);
 Within foo, the original const char* can be obtained using
   const char *svn_wc_adm_access_path(svn_wc_adm_access_t *adm_access);

 The above case sometimes occurs as
   svn_error_t *foo(const char *name, const char *dir);
 where NAME is a single path component, and DIR is a directory. Conversion
 is again simply in this case
   svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access);

 The more difficult case is
   svn_error_t *foo (const char *path);
 where PATH can be a file or a directory.  This occurs a lot in the
 current code. In the long term these may get converted to
   svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access);
 where NAME is a single path component.  However this involves more
 changes to the code calling foo than are strictly necessary, so
 initially they get converted to
   svn_error_t *foo (const char *path, svn_wc_adm_access_t *adm_access);
 where PATH is passed unchanged and an additional access baton is
 passed.  This interface is less than ideal, since there is duplicate
 information in the path and baton, but since it involves fewer changes
 in the calling code it makes a reasonable intermediate step.


 5. Logging
    -------

 As well as caching the other problem that needs to be addressed is the
 issue of logging.  Modifications to the working copy are supposed to
 use the log file mechanism to ensure that multiple changes that need
 to be atomic cannot be partially completed.  If the individual changes
 that may need to be logged are all forced to use an access baton, then
 the access baton may be able to identify when the log file mechanism
 should be used.  Combine this with an access baton state that tracks
 whether a log file is being run and we may be able to automatically
 identify those places that are failing to use the log file mechanism.


 6. Status
    ------

 Entries caching has been implemented.  Write performance has been
 addressed, but could be improved further (section 1).

 The interface changes (section 2) have not been started.

 The access baton conversion is complete in so far as passing batons is
 concerned.  The path->name signature changes (section 4) have not been
 made.

 Automatic detection of failure to use a log file (section 5) has not
 been started.

	"I have a cunning plan"

	or

	Entries Caching in the Access Batons



	0. Preamble
	--------

	Issue 749 provides some history. The access batons now cache the
	parsed entries file, as repeatedly reading, parsing and writing the
	file proved to be a bottleneck.


	1. Caching Interface
	-----------------

	The basic functions to retrieve entries are svn_wc_entries_read and
	svn_wc_entry. The function svn_wc__entries_write is used to update
	the entries file on disk. The function svn_wc__entry_modify is
	implemented in terms of entries_read and entries_write.

	Caching during read-only operations is essentially complete. For
	read-write operations one potential bottleneck remains: calling
	svn_wc__entries_write repeatedly, usually through the function
	svn_wc__entry_modify. This writes the entire file even if only one
	entry has changed. A partial solution is in place: when running a log
	file any changes are accumulated in the access baton and only finally
	written when the log file completes. There is scope to use the same
	mechanism in other places, although it may be best to refrain from
	doing this until the interface changes (see section 2) are resolved.
	A more substantial enhancement is described below.

	1.1 Write Caching Enhancement

	An overview of the update process

	1. Lock the directory
	2. Read the entries file and cache in memory
	3. Start the wc update
	3.1. Receive an update to a file
	3.2. Write a log file
	3.3. Start running log file
	3.3.1. Modify entries in memory
	3.4 Finish log file
	3.5. Flush entries to disk
	3.6. Remove log file
	4. Finish update
	5. Unlock directory

	Each log file may make multiple modifications to the entries file
	(step 3.3.1 above), and each wc update may contain updates for
	multiple files (step 3 above). Multiple modifications during a log
	file are cached in memory and so are not a problem, but multiple files
	result in multiple writes of the entries file (step 3.5 above).
	Writing the entries file is a non-trivial operation, and doing it
	repeatedly means that the run-time increases is non-linearly as the
	number of entries in the directory increases.

	Simply caching the entries modifications in memory for the complete
	update is not possible, it would break the atomic guarantees provided
	by the log file system. When the log file is removed all entries
	modifications need to be present on disk, otherwise an interrupt will
	result in an inconsistent working copy.

	One solution is to track which entries get changed while running a log
	file, and write a temporary, partial, entries file instead of the
	complete file at the end of each log file (step 3.5 above). Then there
	would need to be a write of the complete file only once at the end of
	the update (step 4 above). With this system, if the update is
	interrupted the cleanup mechanism would recognise the partial entries
	files and combine them with the out-of-date entries file on disk. The
	temporary, partial entries files need to be written atomically to
	avoid problems with incomplete writes, and they need to be ordered so
	that they can be read in the order they were written. They also need
	to be cleaned up before removing the lock. The mechanism to track
	modified entries would mark them all unmodified as each temporary,
	partial, entries file is written. Note: during a normal update the
	temporary, partial, entries files do not need to be read.


	2. Interface Enhancements
	----------------------

	2.1 Entries Interface

	A lot of the entries interface has remained unchanged since the
	pre-caching days, and it shows. Of particular concern is the
	svn_wc_entries_read function, as this provides access to the raw data
	within the cache. If the application carelessly modifies the data
	things may go wrong. I would like to remove this function.

	One use of svn_wc_entries_read is in svn_wc__entry_modify, this is
	"within the entries code" and so is not a problem.

	Of the other uses of svn_wc_entries_read the most common is where the
	application wants to iterate over all the entries in a directory. I
	would like to see an interface something like

	typedef struct svn_wc_entry_iterator_t svn_wc_entry_iterator_t;

	svn_wc_entry_iterator_t *
	svn_wc_entry_first(svn_wc_adm_access_t *adm_access,
	apr_pool_t *pool);

	svn_wc_entry_iterator_t *
	svn_wc_entry_next(svn_wc_entry_iterator_t *entry_iterator);

	const svn_wc_entry_t *
	svn_wc_entry_iterator_entry(svn_wc_entry_iterator_t *entry_iterator);

	Note that this provides only const access to the entries, the
	application cannot modify the cached data. All modifications would go
	through svn_wc__entry_modify, and the access batons could keep track
	of whether modifications have been made and not yet written to disk.

	The other uses of svn_wc_entries_read tend to extract a single entry.
	I hope these can be converted to use svn_wc_entry. One slight problem
	is the use of svn_wc_entries_read to intentionally extract a
	directory's entry from its parent. This is done because that's where
	the "deleted" state is stored. I think the entry returned by
	svn_wc_entry could contain this state. Why doesn't it? I don't know,
	possibly it's an accident, or possibly it's intentional as in the past
	parsing two entries files would have been expensive.

	2.2 Access Baton Interface

	I would also like to modify the access baton interface. At present
	the open function detects and skips missing directories when opening a
	directory hierarchy. I would like to record this information in the
	access baton set, and modify the retrieve functions to include an
	svn_boolean_t* parameter that gets set TRUE when a request for a
	missing directory is made. The advantage of doing this is that the
	application could avoid making svn_io_check_path and svn_wc_check_wc
	calls when the access baton already has the information. The function
	prop_path_internal looks like a good candidate for this optimisation.


	3. Access Baton Sets
	-----------------

	Each access baton represents a directory. Access batons can associate
	together in sets. Given an access baton in a set, it possible to
	retrieve any other access baton in the set. When an access baton in a
	set is closed, all other access batons in the set that represent
	subdirectories are also closed. The set is implemented as a hash
	table "owned" by the one baton in any set, but shared by all batons in
	the set.

	At present in the code, access batons are opened in a parent->child
	order. This works well with the shared hash being owned by the first
	baton in each set. There is code to detect if closing a baton will
	destroy the hash while other batons are using it, as far as I know it
	doesn't currently trigger. If it turns out that this needs to be
	supported it should be possible to transfer the hash information to
	another baton.

	The function that opens access batons takes a flag that indicates
	whether to open a single directory, or the complete tree. When a
	single directory is opened only partial entry information is available
	for sub-directories, since getting full information requires reading
	the sub-directory's entry file. An enhancement would be to replace
	the boolean with a tristate, to open just the one directory, the whole
	tree, or the one directory and it's immediate sub-directories. This
	may be useful for the non-recursive operations, some of which have to
	open the entire tree because they need full entry information for the
	sub-directories. The recursive/non-recursive stuff really needs to be
	revisited in most of the client and wc libraries (an issue exists).


	4. Access Baton Conversion
	-----------------------

	Given a function
	svn_error_t foo (const char path);
	if PATH is always a directory then the change that gets made is usually
	svn_error_t foo (svn_wc_adm_access_t adm_access);
	Within foo, the original const char* can be obtained using
	const char svn_wc_adm_access_path(svn_wc_adm_access_t adm_access);

	The above case sometimes occurs as
	svn_error_t foo(const char name, const char *dir);
	where NAME is a single path component, and DIR is a directory. Conversion
	is again simply in this case
	svn_error_t foo (const char name, svn_wc_adm_access_t *adm_access);

	The more difficult case is
	svn_error_t foo (const char path);
	where PATH can be a file or a directory. This occurs a lot in the
	current code. In the long term these may get converted to
	svn_error_t foo (const char name, svn_wc_adm_access_t *adm_access);
	where NAME is a single path component. However this involves more
	changes to the code calling foo than are strictly necessary, so
	initially they get converted to
	svn_error_t foo (const char path, svn_wc_adm_access_t *adm_access);
	where PATH is passed unchanged and an additional access baton is
	passed. This interface is less than ideal, since there is duplicate
	information in the path and baton, but since it involves fewer changes
	in the calling code it makes a reasonable intermediate step.


	5. Logging
	-------

	As well as caching the other problem that needs to be addressed is the
	issue of logging. Modifications to the working copy are supposed to
	use the log file mechanism to ensure that multiple changes that need
	to be atomic cannot be partially completed. If the individual changes
	that may need to be logged are all forced to use an access baton, then
	the access baton may be able to identify when the log file mechanism
	should be used. Combine this with an access baton state that tracks
	whether a log file is being run and we may be able to automatically
	identify those places that are failing to use the log file mechanism.


	6. Status
	------

	Entries caching has been implemented. Write performance has been
	addressed, but could be improved further (section 1).

	The interface changes (section 2) have not been started.

	The access baton conversion is complete in so far as passing batons is
	concerned. The path->name signature changes (section 4) have not been
	made.

	Automatic detection of failure to use a log file (section 5) has not
	been started.