| Introduction |
| ------------ |
| |
| This file documents some potential optimizations for "svn diff" (or more |
| specifically for the diff algorithm in libsvn_diff, which is used in the |
| "diff", "blame", "merge" and "update" subcommands on the client side). |
| |
| There are two broad approaches: |
| |
| - Speed up the existing algorithm, while maintaining a "minimal diff" |
| (i.e. without heuristics). |
| |
| - Introduce a "non-minimal diff" mode, either by adding heuristics in the |
| existing algorithm, or by implementing another algorithm which doesn't |
| guarantee that it will produce the "minimal diff" in all cases. |
| |
| |
| I. Speeding up "minimal" diff (no heuristics) |
| --------------------------------------------- |
| |
| 1) More low-level optimization of prefix/suffix scanning. |
| |
| - Further optimization of the N-ary prefix/suffix scanning. |
| - Add special-case code for N==2 (i.e. "svn diff"). |
| |
| 2) Optimize line hashing. |
| |
| - Merge hash calculation with EOL scanning. |
| - Reduce function calling overhead, including loop setup & finalization. |
| |
| 3) Reduce overhead of line/hash container. |
| |
| - Use a low collision rate / low overhead hash container. |
| |
| 4) Avoid some hashing by exploiting the fact that matching lines often come |
| in series. |
| |
| - If the previous line had a match with the other file, first try to |
| directly compare (memcmp) the next line with the successor of the |
| matched line. Only if it doesn't match, calculate the hash to insert |
| it into the container. |
| - This approach probably conflicts with the "Merge hash calculation with |
| EOL scanning" suggestion. |
| |
| |
| II. Going for a non-minimal diff (i.e. heuristics) |
| -------------------------------------------------- |
| |
| In some cases, heuristics can make a big difference (while not guaranteeing |
| that you'll get a minimal diff). |
| See also issue #1966 (libsvn_diff needs 'non-minimal-diff' mode) [1]. |
| |
| 1) Make prefix/suffix scanning able to skip 1 or a couple of |
| non-matching lines, if it is able to find strictly more matching lines |
| after that, to keep the prefix/suffix scanning going. |
| |
| This will usually work well, but can sometimes lead to missynchronization |
| (see [2]): |
| |
| bxxaxxbxx |
| || || |
| axxbxx |
| |
| instead of (longest common subsequence): |
| |
| bxxaxxbxx |
| ////// |
| axxbxx |
| |
| 2) Add some heuristics-based shortcuts in the LCS algorithm. |
| |
| 3) Implement another diff algorithm, such as "Patience Diff" [3], which is |
| already implemented in several other (D)VCS's. It has the potential to |
| be much faster (reducing the problem to calculating several, much |
| smaller LCS's), and has the added advantage of often producing "nicer" |
| diff output. It is however slightly "heuristical", it doesn't guarantee |
| minimality of the diff. |
| |
| |
| References |
| ---------- |
| |
| [1] https://issues.apache.org/jira/browse/SVN-1966 (libsvn_diff |
| needs 'non-minimal-diff' mode) |
| [2] Miller, W., and Myers, E.W. "A File Comparison Program.", Software - |
| Practice & Experience 15 (1985), pp. 1025-1040. |
| [3] http://bramcohen.livejournal.com/73318.html |