On the diff-optimizations-bytes branch: Remove BRANCH-README, in preparation for reintegration to trunk. git-svn-id: https://svn.apache.org/repos/asf/subversion/branches/diff-optimizations-bytes@1067789 13f79535-47bb-0310-9956-ffa450edef68
diff --git a/BRANCH-README b/BRANCH-README deleted file mode 100644 index 0a784d8..0000000 --- a/BRANCH-README +++ /dev/null
@@ -1,64 +0,0 @@ -The purpose of this branch is to experiment with speeding up 'svn diff', -especially for large files with lots of unchanged lines. - -As a secondary objective, this should also speed up 'svn blame', since blame -performs a diff on the client side for every revision part of the blame -operation. This will only be noticeable if the server and network are fast -enough, so the client becomes the bottleneck (e.g. on a LAN, server having a -fast backend (e.g. FSFS on SSD)). - -General approach: reduce the problem set for the LCS algorithm as much as -possible, by eliminating identical prefix and suffix before putting the -tokens (lines) into the token tree (see [1] for some background). - -Specific approach for this branch: scan for identical prefix/suffix -byte-per-byte, until a difference is found. This is done immediately after -opening the datasources, before getting the tokens (lines) and inserting them -into the token tree. - - -TODO: - - * eliminate identical prefix [DONE] - * eliminate identical suffix [DONE] - * make diff3 and diff4 use datasources_open [DONE] - (this may eliminate the need for datasource_open, and the flag - datasource_opened in token.c#svn_diff__get_tokens) - * implement (part of) increment_pointers and - decrement_pointers with a macro [DONE] - (offers another 10% speedup) - * integrate (some of the) low-level optimizations for prefix/suffix - scanning, proposed by stefan2 [2] [] - * revv svn_diff.h#svn_diff_fns_t [DONE] - -Other wild ideas (may or may not be worth pursuing): - * Take advantage of the fact that matching lines often come in batches: when - inserting a new line into the token tree, and the previous line had a - match, first try to match the new token with the successor of the - previously matched line, before binary-searching the tree for a match. - * Add some balancing to the token tree in token.c#svn_diff__tree_insert_token - to make sure it doesn't degenerate. We may want to investigate first - whether non-balancedness is really a problem in normal use cases. - -Rejected ideas: - * Make prefix scanning able to skip 1 or a couple of non-matching lines, if - it is able to find strictly more matching lines after that, to keep the - prefix scanning going. - - This is not a good idea, because it can lead to bad cases of - missynchronization (see [3]): - - bxxaxxbxx - || || - axxbxx - - instead of (longest common subsequence): - - bxxaxxbxx - ////// - axxbxx - -[1] http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_problem_set -[2] http://svn.haxx.se/dev/archive-2011-01/0005.shtml -[3] Miller, W., and Myers, E.W. "A File Comparison Program.", Software - - Practice & Experience 15 (1985), pp. 1025-1040.