subversion/libsvn_fs/TODO - subversion - Git at Google

 What's happening now

 The filesystem needs some path validation stuffs independent of the
 SVN path utilities.  A filesystem path is a well-defined Thing that
 should be held a safe distance away from future changes to SVN's
 general path library.


 Incorrectnesses

 We must ensure that node numbers are never reused.  If we open a node,
 svn_fs_delete it, and then create new nodes, what happens when the
 original node structure suddenly comes to refer to an entirely
 different node?  Files become directories?

 We should convert filenames to some canonical Unicode form, for
 comparison.

 Does everyone call svn_fs__check_fs who should?

 svn_fs_delete will actually delete non-empty directories, if they're
 not cloned.  This is inconsistent; should it be fixed?

 Does every operation on a deleted node or completed transaction fail
 gracefully?

 Produce helpful error messages when filename paths contain null
 characters.


 Uglinesses

 Fix up comments in svn_fs.h for transactions.

 Add `public name' member to filesystem structure, to use to identify
 the filesystem in error messages.  When driven by DAV, this could be a
 URL.

 When a dag function signals an error, it has no idea what the path of
 the relevant node was.  But node revision ID's are pretty useless to
 the user.  tree.c should probably rewrap some errors.

 svn_fs__getsize shouldn't rely on a maximum value for detecting
 overflow.

 The use of svn_fs__getsize in svn_fs__parse_id is ugly --- what if
 svn_vernum_t and apr_size_t aren't the same size?

 Consider some macros or accessory functions for referencing the pieces
 of the NODE-REVISION skel (instead of seeing stuff like
 node->children->next->next and such other unreadable rubbish)


 Slownesses

 We don't store older node revisions as deltas yet.

 The delta algorithm walks the whole tree using a single pool, so the
 memory used is proportional to the size of the target tree.  Instead,
 it should use a separate subpool every time it recurses into a new
 directory, and free that subpool as soon as it's done processing that
 subdirectory, so the memory used is proportional to the depth of the
 tree.

 We should move as much real content out of the NODE-REVISION skel as
 possible; the skels should be holding only small stuff (node kind,
 flags).
 - File contents and deltas should be moved out to a `contents' table.
   The NODE-REVISION skel should simply contain a key into that table.
 - Directory contents should be moved out to a `directories' table,
   with a separate table entry for each directory entry.  Keys into the
   table should be of the form `NODE-ID ENTRY-NAME NODE-REVISION', and
   values should be node revision ID's, or the word `deleted'; to look
   up an entry named E in a directory whose node revision is N.R,
   search for the entry `N E x', where x is the largest number present
   <= R.
 - Property lists should be moved out to a table `properties', indexed
   similarly to the above.  We could deltify property contents the
   same way we do file contents.


 Amenities

 Extend svn_fs_copy to handle mutable nodes.

 Long term ideas:

 - directory entry cache:
   Create a cache mapping a node revision id X plus a filename component
   N onto a new node revision id Y, meaning that X is a directory in
   which the name N is bound to ID Y.  If everything were in the cache,
   this function could run with no I/O except for the final node.

   Since node revisions never change, we wouldn't have to worry about
   invalidating the cache.  Mutable node objects will need special
   handling, of course.

 - fulltext cache:
   If we've recently computed a node's fulltext, we might want to keep
   that around in case we need to compute one of its nearby ancestors'
   fulltext, too.  This could be a waste, though --- the access
   patterns are a mix of linear scan (backwards to reconstruct a given
   revision) and random (who knows what node we'll hit next), so it's
   not clear what cache policy would be effective.  Best to record some
   data on how many delta applications a given cache would avoid before
   implementing it.

 - delta cache:
   As people update, we're going to be recomputing text deltas for the
   most recently changed files pretty often.  It might be worthwhile to
   cache the deltas for a little while.

 - Handle Unicode canonicalization for directory and property names
   ourselves.  People should be able to hand us any valid UTF-8
   sequence, perhaps with precomposed characters or non-spacing marks
   in a non-canonical order, and find the appropriate matches, given
   the rules defined by the Unicode standard.

 Keeping repositories alive in the long term: Berkeley DB is infamous
 for changing its file format from one revision to the next.  If someone
 saves a Subversion 1.0 repository on a CD somewhere, and then tries to
 read it seven years later, their chance of being able to read it with
 the latest revision of Subversion is nil.  The solution:

 - Define a simply XML repository dump format for the complete
   repository data.  This should be the same format we use for CVS
   repository conversion.  We'll have an import function.

 - Write a program that is simple and self-contained --- does not use
   Berkeley DB, no fancy XML tools, uses nothing but POSIX read and
   seek --- that can dump a Subversion repository in that format.

 - For each revision of Subversion, make a sample repository, and
   archive a copy of it away as test data.

 - Write a test suite that verifies that the repository dump program
   can handle all of the archived formats.
	What's happening now

	The filesystem needs some path validation stuffs independent of the
	SVN path utilities. A filesystem path is a well-defined Thing that
	should be held a safe distance away from future changes to SVN's
	general path library.


	Incorrectnesses

	We must ensure that node numbers are never reused. If we open a node,
	svn_fs_delete it, and then create new nodes, what happens when the
	original node structure suddenly comes to refer to an entirely
	different node? Files become directories?

	We should convert filenames to some canonical Unicode form, for
	comparison.

	Does everyone call svn_fs__check_fs who should?

	svn_fs_delete will actually delete non-empty directories, if they're
	not cloned. This is inconsistent; should it be fixed?

	Does every operation on a deleted node or completed transaction fail
	gracefully?

	Produce helpful error messages when filename paths contain null
	characters.


	Uglinesses

	Fix up comments in svn_fs.h for transactions.

	Add `public name' member to filesystem structure, to use to identify
	the filesystem in error messages. When driven by DAV, this could be a
	URL.

	When a dag function signals an error, it has no idea what the path of
	the relevant node was. But node revision ID's are pretty useless to
	the user. tree.c should probably rewrap some errors.

	svn_fs__getsize shouldn't rely on a maximum value for detecting
	overflow.

	The use of svn_fs__getsize in svn_fs__parse_id is ugly --- what if
	svn_vernum_t and apr_size_t aren't the same size?

	Consider some macros or accessory functions for referencing the pieces
	of the NODE-REVISION skel (instead of seeing stuff like
	node->children->next->next and such other unreadable rubbish)


	Slownesses

	We don't store older node revisions as deltas yet.

	The delta algorithm walks the whole tree using a single pool, so the
	memory used is proportional to the size of the target tree. Instead,
	it should use a separate subpool every time it recurses into a new
	directory, and free that subpool as soon as it's done processing that
	subdirectory, so the memory used is proportional to the depth of the
	tree.

	We should move as much real content out of the NODE-REVISION skel as
	possible; the skels should be holding only small stuff (node kind,
	flags).
	- File contents and deltas should be moved out to a `contents' table.
	The NODE-REVISION skel should simply contain a key into that table.
	- Directory contents should be moved out to a `directories' table,
	with a separate table entry for each directory entry. Keys into the
	table should be of the form `NODE-ID ENTRY-NAME NODE-REVISION', and
	values should be node revision ID's, or the word `deleted'; to look
	up an entry named E in a directory whose node revision is N.R,
	search for the entry `N E x', where x is the largest number present
	<= R.
	- Property lists should be moved out to a table `properties', indexed
	similarly to the above. We could deltify property contents the
	same way we do file contents.


	Amenities

	Extend svn_fs_copy to handle mutable nodes.

	Long term ideas:

	- directory entry cache:
	Create a cache mapping a node revision id X plus a filename component
	N onto a new node revision id Y, meaning that X is a directory in
	which the name N is bound to ID Y. If everything were in the cache,
	this function could run with no I/O except for the final node.

	Since node revisions never change, we wouldn't have to worry about
	invalidating the cache. Mutable node objects will need special
	handling, of course.

	- fulltext cache:
	If we've recently computed a node's fulltext, we might want to keep
	that around in case we need to compute one of its nearby ancestors'
	fulltext, too. This could be a waste, though --- the access
	patterns are a mix of linear scan (backwards to reconstruct a given
	revision) and random (who knows what node we'll hit next), so it's
	not clear what cache policy would be effective. Best to record some
	data on how many delta applications a given cache would avoid before
	implementing it.

	- delta cache:
	As people update, we're going to be recomputing text deltas for the
	most recently changed files pretty often. It might be worthwhile to
	cache the deltas for a little while.

	- Handle Unicode canonicalization for directory and property names
	ourselves. People should be able to hand us any valid UTF-8
	sequence, perhaps with precomposed characters or non-spacing marks
	in a non-canonical order, and find the appropriate matches, given
	the rules defined by the Unicode standard.

	Keeping repositories alive in the long term: Berkeley DB is infamous
	for changing its file format from one revision to the next. If someone
	saves a Subversion 1.0 repository on a CD somewhere, and then tries to
	read it seven years later, their chance of being able to read it with
	the latest revision of Subversion is nil. The solution:

	- Define a simply XML repository dump format for the complete
	repository data. This should be the same format we use for CVS
	repository conversion. We'll have an import function.

	- Write a program that is simple and self-contained --- does not use
	Berkeley DB, no fancy XML tools, uses nothing but POSIX read and
	seek --- that can dump a Subversion repository in that format.

	- For each revision of Subversion, make a sample repository, and
	archive a copy of it away as test data.

	- Write a test suite that verifies that the repository dump program
	can handle all of the archived formats.