Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Comparison of HTML documents with Perl

by sutch (Curate)
on Feb 24, 2005 at 20:50 UTC ( [id://434252]=perlquestion: print w/replies, xml ) Need Help??

sutch has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a Perl solution which displays an HTML document that illustrates the differences between two versions of an HTML document. It needs to "accept" two HTML documents and output a new HTML document that shows sections removed, sections added, and sections changed. I believe that it must know about HTML so that it doesn't hightlight tags that have changed, but somehow shows that formatting has changed.

I've found HTML::Diff, but the documentation is skimpy and only seems to handle providing the backend functionality of returning structures that show what has changed.

Are there any other options out there that are Perl based?

TIA

Replies are listed 'Best First'.
Re: Comparison of HTML documents with Perl
by talexb (Chancellor) on Feb 24, 2005 at 21:29 UTC

    I've had good luck with HTML::Parser, but I think what you're asking for could end up being infinitely complicated.

    You want to end up doing a high level compare, not a line by line or word by word compare. If you have to handle anyone's HTML, that could be impossible. If you're trying to version your own HTML, that will probably be easier -- you can focus on just a few tags.

    I think I'd start by comparing the structure of the document for changes, and see where a paragraph has been added or a table row has been deleted. From there, I'd look at the text within the parts whose structure has not changed. Boy, that's a really interesting project. Let us know how it turns out.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Comparison of HTML documents with Perl
by ww (Archbishop) on Feb 25, 2005 at 01:25 UTC
    If the critical issue is "content-change", it might be simpler to strip all the html and compare the two resultant text files. Then, if I understand your intent, if the text (content) varies, HTML::Diff might make it easy to finish the job? Alts in the HTML-Parser family, including http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm
Re: Comparison of HTML documents with Perl
by zby (Vicar) on Feb 25, 2005 at 23:36 UTC
    You should look at my Yet another HTML diff. I believe it is much better than HTML::Diff which uses regexps for parsing HTML so it works only for some pages. The explanation is there so I won't duplicate it now - but I need to add that I use it daily in Active Bookmarks for checking updates to interesting sites.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://434252]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-19 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found