so, now that i've made a few changes, i think you could do
this by keeping a running total of the hunk sizes and then
comparing it to the number of lines in either the original
or the revision. however, i'm not really sure what would be
an appropriate heuristic. perhaps showing both
$total_deletions / $original_lines and
$total_additions / $revision_lines.
you could also use LCS instead of diff and compare the
size of the LCS (Longest Common Subsequence) to the size
of the original or revised token list. this would allow
you to say roughly "Revision is 80% similar to original"
if @LCS / @original == 0.8.
i have updated my old post to include this heuristic.
but how about some luxus in your script as well? ( I mean there always is.)
I'm thinking about identifying something like: lines 20 to 26 in document A are not in document B, even if the rest is the same. This sometimes drives me crazy when comparing not Html but other txt.files?
But it's just an idea.