Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

HTML Document Comparison

by xjar (Pilgrim)
on Sep 13, 2000 at 19:44 UTC ( #32285=perlquestion: print w/ replies, xml ) Need Help??
xjar has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I need to write a program to compare two HTML documents to determine if they are similar enough to be considered "the same". What I was thinking of doing is this (keep in mind, I'm a neophyte, so if my ideas are pretty poor, be kind):

  • Read each document into an array, line by line
  • Strip the newline off of each array element
  • "Concatenate" each array element into a string variable, so that in the end, each variable will hold an entire document
  • Take a substr() of each variable, say 150 characters in, and then take 100 characters from there. If the two are the same, then the documents are the same.

    Now, I'm not sure how efficient this will be, especially with the swapping from array to variable. Can anyone provide me with some ideas, or even (hehe) a module that can help with this?

    Much thanks, xjar

  • Comment on HTML Document Comparison
    Re: HTML Document Comparison
    by merlyn (Sage) on Sep 13, 2000 at 19:49 UTC
        how about this:
        #!/usr/bin/perl -w use strict; use Algorithm::Diff qw(diff LCS); use HTML::TokeParser; use LWP::Simple; sub tokenize_url { my $url = shift; my $content = get $url or die $!; my $p = new HTML::TokeParser(\$content); my (@tokens, $token); push @tokens, $token while (defined ($token = $p->get_token)); \@tokens; } my @content = map { tokenize_url($_) } qw{ http://perlmonks.org/index.pl?node_id=32285 http://perlmonks.org/index.pl?node_id=32286 }; # hash tokens based on their text content sub hash_token {$_[0][$_[0][0] eq 'T' ? 1 : -1]} my @diffs = diff $content[0], $content[1], \&hash_token; my @LCS = LCS $content[0], $content[1], \&hash_token; my $largest = 0; for my $hunk (@diffs) { my (@deletions, @additions); for (@$hunk) { push @deletions, $_ if $_->[0] eq '-'; push @additions, $_ if $_->[0] eq '+'; } my $size = @deletions > @additions ? @deletions : @additions; $largest = $size if $size > $largest; } print scalar(@{$content[0]}), " line", (@{$content[0]} == 1 ? '' : 's'), " in original", $/; print scalar(@{$content[1]}), " line", (@{$content[1]} == 1 ? '' : 's'), " in revision", $/; print scalar(@diffs), " hunk", (@diffs == 1 ? '' : 's'), " differ", $/; print $largest, " line", ($largest == 1 ? '' : 's'), " in largest hunk", $/; printf "Revision %0.2f%% similar to original$/", 100 * @LCS / @{$content[0]};

        updated 2001-Aug-01: small code changes; renamed from "RE: Re: HTML Document Comparison"

          Way cool! Now I can't look too hard at that if I'm going to reimplement it for the column, but way cool!

          Regarding the original poster's question, can you get some quantitization of "how much" of the file is changed, like 0 to 100%?

          -- Randal L. Schwartz, Perl hacker

    Re: HTML Document Comparison
    by xjar (Pilgrim) on Sep 13, 2000 at 22:00 UTC
      Thank you both, merlyn and mdillon, for the help... heh, this is one of the reasons i love perl monks so much!

      maybe with some minor modification, it looks like mdillon's code might be just what i was looking for, and seemingly far more efficient than what i proposed.

      much thanks, xjar

    Re: HTML Document Comparison
    by cbraga (Pilgrim) on Sep 14, 2000 at 01:25 UTC
      How about that:

      1. Read the document and strip off everything except the html tags, including all newlines;
      2. Take the MD5 hash of the tag structure;
      3. Compare the MD5s of the documents to determine if they have the same structure. Or are derived from the same template with different text.

      Advantadges:

      * Accounts for markup changes, so differences in the text are not significant;
      * If you have a lot of documents, MD5s are easy and quick to compare, as opposed to whole documents.

      Disadvantadges:

      * Accounts for markup changes, so differences in the text are not significant;
      * MD5s are very strict, so there's no telling between small and big differences.

      Actually this isn't my idea, it was used in some web survey to count the number of unique sites on the net.

        Yah NetCraft. Which is a great idea, EXCEPT that the changes may be text body within a template. They were looking for templated documents specifically.

        Still, a cool idea worthy of the ++ I stuck on it =)

        --
        $you = new YOU;
        honk() if $you->love(perl)

    Re: HTML Document Comparison
    by moen (Hermit) on Sep 14, 2000 at 01:37 UTC
      k, newbie answer ahead. Something like this gave me the proper answer 0||1 for equality using String::Approx. Of course not any validity against proper html code or anything, just plain text match.
      use String::Approx 'amatch'; $match = amatch(@txt1, @txt2));
      where @txt1 is the original document and @txt2 is the comparing document. And $match will give you 0||1 for the match.
        Well, better would probably be String::Similarity (which I found out only recently, and right here in the Monestary) which gives a value between 0 and 1. In fact, for the original application, that might be the easiest. {grin}

        -- Randal L. Schwartz, Perl hacker

    Re: HTML Document Comparison
    by planetscape (Canon) on Mar 22, 2008 at 21:07 UTC
    Re: HTML Document Comparison
    by ww (Bishop) on Mar 22, 2008 at 21:47 UTC

      This is a little late for OP, probably, but since planetscape has cross-referenced this thread with a more recent one, what follows may still have some value for future readers

      .."to determine if they are similar enough to be considered "the same"

      There are some great answers, above, but note that some of them touch on what seems to me to be the threshold problem. To be explicit:

      What are the criteria for sameness? For one example, are two pages "to be considered the same" if one presents a given data set as a piechart and another as a barchart? ...in tabular fashion rather than as a block of lines?

      More generally, are we trying to see if:

      1. the two documents render the same content with mere typographic differences occuring solely because of variant markup? (and what makes those differences "mere?")
      2. the content is different, but the general appearance (layout) is the same or similar? (and for what value of "similar?")

      If you sample only segments of the comparatives, how much risk of a false positive are you willing to accept?

      Much as I admire the posts above, I think you need to answer these (and similar) questions for your project before adopting a method.

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://32285]
    Approved by root
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (12)
    As of 2014-10-20 08:48 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (74 votes), past polls