BRANCH-README - subversion - Git at Google

 The purpose of this branch is to experiment with speeding up 'svn diff',
 especially for large files with lots of unchanged lines.

 As a secondary objective, this should also speed up 'svn blame', since blame
 performs a diff on the client side for every revision part of the blame
 operation. This will only be noticeable if the server and network are fast
 enough, so the client becomes the bottleneck (e.g. on a LAN, server having a
 fast backend (e.g. FSFS on SSD)).

 General approach: reduce the problem set for the LCS algorithm as much as
 possible, by eliminating identical prefix and suffix before putting the
 tokens (lines) into the token tree (see [1] for some background).

 Specific approach for this branch: scan for identical prefix/suffix
 byte-per-byte, until a difference is found. This is done immediately after
 opening the datasources, before getting the tokens (lines) and inserting them
 into the token tree.


 TODO:

   * eliminate identical prefix                 [DONE]
   * eliminate identical suffix                 [DONE]
   * make diff3 and diff4 use datasources_open  [DONE]
      (this may eliminate the need for datasource_open, and the flag
       datasource_opened in token.c#svn_diff__get_tokens)
   * implement (part of) increment_pointers and
     decrement_pointers with a macro            [DONE]
      (offers another 10% speedup)
   * integrate (some of the) low-level optimizations for prefix/suffix
     scanning, proposed by stefan2 [2]          []
   * revv svn_diff.h#svn_diff_fns_t             [DONE]

 Other wild ideas (may or may not be worth pursuing):
   * Take advantage of the fact that matching lines often come in batches: when
     inserting a new line into the token tree, and the previous line had a
     match, first try to match the new token with the successor of the
     previously matched line, before binary-searching the tree for a match.
   * Add some balancing to the token tree in token.c#svn_diff__tree_insert_token
     to make sure it doesn't degenerate. We may want to investigate first
     whether non-balancedness is really a problem in normal use cases.

 Rejected ideas:
   * Make prefix scanning able to skip 1 or a couple of non-matching lines, if
     it is able to find strictly more matching lines after that, to keep the
     prefix scanning going.

     This is not a good idea, because it can lead to bad cases of
     missynchronization (see [3]):

       bxxaxxbxx
        || ||
       axxbxx

     instead of (longest common subsequence):

       bxxaxxbxx
         //////
        axxbxx

 [1] http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_problem_set
 [2] http://svn.haxx.se/dev/archive-2011-01/0005.shtml
 [3] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
     Practice & Experience 15 (1985), pp. 1025-1040.
	The purpose of this branch is to experiment with speeding up 'svn diff',
	especially for large files with lots of unchanged lines.

	As a secondary objective, this should also speed up 'svn blame', since blame
	performs a diff on the client side for every revision part of the blame
	operation. This will only be noticeable if the server and network are fast
	enough, so the client becomes the bottleneck (e.g. on a LAN, server having a
	fast backend (e.g. FSFS on SSD)).

	General approach: reduce the problem set for the LCS algorithm as much as
	possible, by eliminating identical prefix and suffix before putting the
	tokens (lines) into the token tree (see [1] for some background).

	Specific approach for this branch: scan for identical prefix/suffix
	byte-per-byte, until a difference is found. This is done immediately after
	opening the datasources, before getting the tokens (lines) and inserting them
	into the token tree.


	TODO:

	* eliminate identical prefix [DONE]
	* eliminate identical suffix [DONE]
	* make diff3 and diff4 use datasources_open [DONE]
	(this may eliminate the need for datasource_open, and the flag
	datasource_opened in token.c#svn_diff__get_tokens)
	* implement (part of) increment_pointers and
	decrement_pointers with a macro [DONE]
	(offers another 10% speedup)
	* integrate (some of the) low-level optimizations for prefix/suffix
	scanning, proposed by stefan2 [2] []
	* revv svn_diff.h#svn_diff_fns_t [DONE]

	Other wild ideas (may or may not be worth pursuing):
	* Take advantage of the fact that matching lines often come in batches: when
	inserting a new line into the token tree, and the previous line had a
	match, first try to match the new token with the successor of the
	previously matched line, before binary-searching the tree for a match.
	* Add some balancing to the token tree in token.c#svn_diff__tree_insert_token
	to make sure it doesn't degenerate. We may want to investigate first
	whether non-balancedness is really a problem in normal use cases.

	Rejected ideas:
	* Make prefix scanning able to skip 1 or a couple of non-matching lines, if
	it is able to find strictly more matching lines after that, to keep the
	prefix scanning going.

	This is not a good idea, because it can lead to bad cases of
	missynchronization (see [3]):

	bxxaxxbxx
	\|\| \|\|
	axxbxx

	instead of (longest common subsequence):

	bxxaxxbxx
	//////
	axxbxx

	[1] http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_problem_set
	[2] http://svn.haxx.se/dev/archive-2011-01/0005.shtml
	[3] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
	Practice & Experience 15 (1985), pp. 1025-1040.