blob: 52919cf7c8138f4112cf8ff20a22206d1cf74788 [file] [log] [blame]
Introduction
------------
This file documents some potential optimizations for "svn diff" (or more
specifically for the diff algorithm in libsvn_diff, which is used in the
"diff", "blame", "merge" and "update" subcommands on the client side).
There are two broad approaches:
- Speed up the existing algorithm, while maintaining a "minimal diff"
(i.e. without heuristics).
- Introduce a "non-minimal diff" mode, either by adding heuristics in the
existing algorithm, or by implementing another algorithm which doesn't
guarantee that it will produce the "minimal diff" in all cases.
I. Speeding up "minimal" diff (no heuristics)
---------------------------------------------
1) More low-level optimization of prefix/suffix scanning.
- Further optimization of the N-ary prefix/suffix scanning.
- Add special-case code for N==2 (i.e. "svn diff").
2) Optimize line hashing.
- Merge hash calculation with EOL scanning.
- Reduce function calling overhead, including loop setup & finalization.
3) Reduce overhead of line/hash container.
- Use a low collision rate / low overhead hash container.
4) Avoid some hashing by exploiting the fact that matching lines often come
in series.
- If the previous line had a match with the other file, first try to
directly compare (memcmp) the next line with the successor of the
matched line. Only if it doesn't match, calculate the hash to insert
it into the container.
- This approach probably conflicts with the "Merge hash calculation with
EOL scanning" suggestion.
II. Going for a non-minimal diff (i.e. heuristics)
--------------------------------------------------
In some cases, heuristics can make a big difference (while not guaranteeing
that you'll get a minimal diff).
See also issue #1966 (libsvn_diff needs 'non-minimal-diff' mode) [1].
1) Make prefix/suffix scanning able to skip 1 or a couple of
non-matching lines, if it is able to find strictly more matching lines
after that, to keep the prefix/suffix scanning going.
This will usually work well, but can sometimes lead to missynchronization
(see [2]):
bxxaxxbxx
|| ||
axxbxx
instead of (longest common subsequence):
bxxaxxbxx
//////
axxbxx
2) Add some heuristics-based shortcuts in the LCS algorithm.
3) Implement another diff algorithm, such as "Patience Diff" [3], which is
already implemented in several other (D)VCS's. It has the potential to
be much faster (reducing the problem to calculating several, much
smaller LCS's), and has the added advantage of often producing "nicer"
diff output. It is however slightly "heuristical", it doesn't guarantee
minimality of the diff.
References
----------
[1] https://issues.apache.org/jira/browse/SVN-1966 (libsvn_diff
needs 'non-minimal-diff' mode)
[2] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
Practice & Experience 15 (1985), pp. 1025-1040.
[3] http://bramcohen.livejournal.com/73318.html