Comparison of HTML documents with Perl

sutch has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a Perl solution which displays an HTML document that illustrates the differences between two versions of an HTML document. It needs to "accept" two HTML documents and output a new HTML document that shows sections removed, sections added, and sections changed. I believe that it must know about HTML so that it doesn't hightlight tags that have changed, but somehow shows that formatting has changed.

I've found HTML::Diff, but the documentation is skimpy and only seems to handle providing the backend functionality of returning structures that show what has changed.

Are there any other options out there that are Perl based?

TIA

Comment on Comparison of HTML documents with Perl

Replies are listed 'Best First'.
Re: Comparison of HTML documents with Perl by talexb (Chancellor) on Feb 24, 2005 at 21:29 UTC
I've had good luck with HTML::Parser, but I think what you're asking for could end up being infinitely complicated. You want to end up doing a high level compare, not a line by line or word by word compare. If you have to handle anyone's HTML, that could be impossible. If you're trying to version your own HTML, that will probably be easier -- you can focus on just a few tags. I think I'd start by comparing the structure of the document for changes, and see where a paragraph has been added or a table row has been deleted. From there, I'd look at the text within the parts whose structure has not changed. Boy, that's a really interesting project. Let us know how it turns out. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: Comparison of HTML documents with Perl by ww (Archbishop) on Feb 25, 2005 at 01:25 UTC
If the critical issue is "content-change", it might be simpler to strip all the html and compare the two resultant text files. Then, if I understand your intent, if the text (content) varies, HTML::Diff might make it easy to finish the job? Alts in the HTML-Parser family, including http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm	[reply]
Re: Comparison of HTML documents with Perl by zby (Vicar) on Feb 25, 2005 at 23:36 UTC
You should look at my Yet another HTML diff. I believe it is much better than HTML::Diff which uses regexps for parsing HTML so it works only for some pages. The explanation is there so I won't duplicate it now - but I need to add that I use it daily in Active Bookmarks for checking updates to interesting sites.	[reply]


P is for Practical
	PerlMonks