Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

cmp two HTML fragments

by GrandFather (Saint)
on Feb 09, 2008 at 20:54 UTC ( [id://667203]=CUFP: print w/replies, xml ) Need Help??

I had a need to compare two fragments of HTML to see if they were equivalent.

This snippet builds two HTML::TreeBuilder representations of the fragments, then recursively compares the contents of the fragments.

To use the snippet call cmpHtml passing the two fragments as strings:

print cmpHtml( '<p><font foo="bar" bar="1">bar 1</font></p>', '<p><font bar="2" foo="bar">bar 1</font></p>' );

or if you already have two HTML::Elements that you want to compare you can:

print cmpHtmlElt ($elt1, $elt2);
sub cmpHtml { my ($html1, $html2) = @_; my $root1 = HTML::TreeBuilder->new; my $root2 = HTML::TreeBuilder->new; $root1->parse_content ($html1); $root1->elementify (); $root2->parse_content ($html2); $root2->elementify (); return cmpHtmlElt ($root1, $root2); } sub cmpHtmlElt { my ($elt1, $elt2) = @_; my $cmp = defined $elt1 cmp defined $elt2; return $cmp if $cmp; return 0 unless defined $elt1; $cmp = ref $elt1 cmp ref $elt2; return $cmp if $cmp; return $elt1 cmp $elt2 unless ref $elt1; $cmp = $elt1->tag () cmp $elt2->tag (); return $cmp if $cmp; my %attribs1 = $elt1->all_attr (); my %attribs2 = $elt2->all_attr (); $cmp = keys %attribs1 <=> keys %attribs2; return $cmp if $cmp; for my $key (keys %attribs1) { return 1 unless exists $attribs2{$key}; next if $key =~ /^_/; $cmp = $attribs1{$key} cmp $attribs2{$key}; return $cmp if $cmp; } my @children1 = $elt1->content_list (); my @children2 = $elt2->content_list (); $cmp = @children1 <=> @children2; return $cmp if $cmp; for my $index (0 .. $#children1) { $cmp = cmpHtmlElt ($children1[$index], $children2[$index]); return $cmp if $cmp; } }

Replies are listed 'Best First'.
Re: cmp two HTML fragments
by lodin (Hermit) on Feb 10, 2008 at 14:51 UTC

    Nice. Have you considered turning this into a module?

    Another way to do this is to use HTML::PrettyPrinter or somesuch and do a string-wise comparision. That way it's easier to find how the code differes (using string diff tools) if needed, but it's probably a lot slower.

    There's an (inherited) bug in your code. It leaks memory. You need to free the circular references in the tree by using the delete method:

    sub cmpHtml { ... my $cmp = cmpHtmlElt ($root1, $root2); $_->delete for $root1, $root2; return $cmp; }

    As a parenthesis I'd like to share this little trick:

    $cmp = EXPR; return $cmp if $cmp;
    which you use make plenty use of can be replaced with
    { return EXPR || next }
    (assuming scalar context) though that may be a bit too obfuscated to use in public code. :-)

    lodin

      as it happens the code shown was pretty transient anyway. For the module test suite that I wrote the code for, I replaced it with:

      my $root1 = HTML::TreeBuilder->new (); my $root2 = HTML::TreeBuilder->new (); $root1->parse_content ($rendered)->elementify () ->delete_ignorable_whitespace (); $root2->parse_content ($expected)->elementify () ->delete_ignorable_whitespace (); is ($root1->as_HTML (undef, ' ', {}), $root2->as_HTML (undef, ' ', {}), $testName);

      in any case so that I'd get better diagnostics (I see the two HTML fragments when the test fails). However, with a little tweaking to give a traceback the original code would be even better in the test context because it would highlight the difference by reducing the clutter. That version might almost be worth generating a module for.


      Perl is environmentally friendly - it saves trees
Re: cmp two HTML fragments
by planetscape (Chancellor) on Mar 22, 2008 at 21:09 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://667203]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2025-07-19 15:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.