| The purpose of this branch is to experiment with speeding up 'svn diff', |
| especially for large files with lots of unchanged lines. |
| |
| As a secondary objective, this should also speed up 'svn blame', since blame |
| performs a diff on the client side for every revision part of the blame |
| operation. This will only be noticeable if the server and network are fast |
| enough, so the client becomes the bottleneck (e.g. on a LAN, server having a |
| fast backend (e.g. FSFS on SSD)). |
| |
| General approach: reduce the problem set for the LCS algorithm as much as |
| possible, by eliminating identical prefix and suffix before putting the |
| tokens (lines) into the token tree (see [1] for some background). |
| |
| Specific approach for this branch: scan for identical prefix/suffix |
| byte-per-byte, until a difference is found. This is done immediately after |
| opening the datasources, before getting the tokens (lines) and inserting them |
| into the token tree. |
| |
| |
| TODO: |
| |
| * eliminate identical prefix [DONE] |
| * eliminate identical suffix [DONE] |
| * make diff3 and diff4 use datasources_open [DONE] |
| (this may eliminate the need for datasource_open, and the flag |
| datasource_opened in token.c#svn_diff__get_tokens) |
| * implement (part of) increment_pointers and |
| decrement_pointers with a macro [DONE] |
| (offers another 10% speedup) |
| * integrate (some of the) low-level optimizations for prefix/suffix |
| scanning, proposed by stefan2 [2] [] |
| * revv svn_diff.h#svn_diff_fns_t [DONE] |
| |
| Other wild ideas (may or may not be worth pursuing): |
| * Take advantage of the fact that matching lines often come in batches: when |
| inserting a new line into the token tree, and the previous line had a |
| match, first try to match the new token with the successor of the |
| previously matched line, before binary-searching the tree for a match. |
| * Add some balancing to the token tree in token.c#svn_diff__tree_insert_token |
| to make sure it doesn't degenerate. We may want to investigate first |
| whether non-balancedness is really a problem in normal use cases. |
| |
| Rejected ideas: |
| * Make prefix scanning able to skip 1 or a couple of non-matching lines, if |
| it is able to find strictly more matching lines after that, to keep the |
| prefix scanning going. |
| |
| This is not a good idea, because it can lead to bad cases of |
| missynchronization (see [3]): |
| |
| bxxaxxbxx |
| || || |
| axxbxx |
| |
| instead of (longest common subsequence): |
| |
| bxxaxxbxx |
| ////// |
| axxbxx |
| |
| [1] http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_problem_set |
| [2] http://svn.haxx.se/dev/archive-2011-01/0005.shtml |
| [3] Miller, W., and Myers, E.W. "A File Comparison Program.", Software - |
| Practice & Experience 15 (1985), pp. 1025-1040. |