blob: 0a784d86a1ea1335a65f6c95daf9c1b2ad6845ec [file] [log] [blame]
The purpose of this branch is to experiment with speeding up 'svn diff',
especially for large files with lots of unchanged lines.
As a secondary objective, this should also speed up 'svn blame', since blame
performs a diff on the client side for every revision part of the blame
operation. This will only be noticeable if the server and network are fast
enough, so the client becomes the bottleneck (e.g. on a LAN, server having a
fast backend (e.g. FSFS on SSD)).
General approach: reduce the problem set for the LCS algorithm as much as
possible, by eliminating identical prefix and suffix before putting the
tokens (lines) into the token tree (see [1] for some background).
Specific approach for this branch: scan for identical prefix/suffix
byte-per-byte, until a difference is found. This is done immediately after
opening the datasources, before getting the tokens (lines) and inserting them
into the token tree.
TODO:
* eliminate identical prefix [DONE]
* eliminate identical suffix [DONE]
* make diff3 and diff4 use datasources_open [DONE]
(this may eliminate the need for datasource_open, and the flag
datasource_opened in token.c#svn_diff__get_tokens)
* implement (part of) increment_pointers and
decrement_pointers with a macro [DONE]
(offers another 10% speedup)
* integrate (some of the) low-level optimizations for prefix/suffix
scanning, proposed by stefan2 [2] []
* revv svn_diff.h#svn_diff_fns_t [DONE]
Other wild ideas (may or may not be worth pursuing):
* Take advantage of the fact that matching lines often come in batches: when
inserting a new line into the token tree, and the previous line had a
match, first try to match the new token with the successor of the
previously matched line, before binary-searching the tree for a match.
* Add some balancing to the token tree in token.c#svn_diff__tree_insert_token
to make sure it doesn't degenerate. We may want to investigate first
whether non-balancedness is really a problem in normal use cases.
Rejected ideas:
* Make prefix scanning able to skip 1 or a couple of non-matching lines, if
it is able to find strictly more matching lines after that, to keep the
prefix scanning going.
This is not a good idea, because it can lead to bad cases of
missynchronization (see [3]):
bxxaxxbxx
|| ||
axxbxx
instead of (longest common subsequence):
bxxaxxbxx
//////
axxbxx
[1] http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_problem_set
[2] http://svn.haxx.se/dev/archive-2011-01/0005.shtml
[3] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
Practice & Experience 15 (1985), pp. 1025-1040.