Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Perl: the Markov chain saw
 
PerlMonks  

RE: RE: RE: Re: HTML Document Comparison

by mdillon (Priest)
on Sep 13, 2000 at 20:49 UTC ( #32296=note: print w/ replies, xml ) Need Help??


in reply to RE: RE: Re: HTML Document Comparison
in thread HTML Document Comparison

so, now that i've made a few changes, i think you could do this by keeping a running total of the hunk sizes and then comparing it to the number of lines in either the original or the revision. however, i'm not really sure what would be an appropriate heuristic. perhaps showing both $total_deletions / $original_lines and $total_additions / $revision_lines.

you could also use LCS instead of diff and compare the size of the LCS (Longest Common Subsequence) to the size of the original or revised token list. this would allow you to say roughly "Revision is 80% similar to original" if @LCS / @original == 0.8.

i have updated my old post to include this heuristic.


Comment on RE: RE: RE: Re: HTML Document Comparison
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://32296]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (14)
As of 2014-04-18 11:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (466 votes), past polls