Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

HTML tidy, using XML::LibXML

by merlyn (Sage)
on Jan 04, 2003 at 16:38 UTC ( #224272=snippet: print w/ replies, xml ) Need Help??

Description: Turn messy HTML into standards-compliant HTML with all close tags properly noted. Takes STDIN, or a list of files, and result goes on STDOUT.
#!/usr/bin/perl
use XML::LibXML;
print XML::LibXML->new->parse_html_string(join "", <>)->toStringHTML;
Comment on HTML tidy, using XML::LibXML
Download Code
Replies are listed 'Best First'.
Re: HTML tidy, using XML::LibXML
by Matts (Deacon) on Jan 04, 2003 at 17:10 UTC
    If you've got sloppy HTML that needs tidying up and XML::LibXML barfs on it, you can change that one-liner to the following three-liner:
    my $parser = XML::LibXML->new(); $parser->recover(1); # Set recovery on error flag $parser->parse_html_string(join "", <>)->toStringHTML;
    XML::LibXML can also parse SGML. Most people don't know that (probably sucky docs ;-)

      Don't know if it's accurate but the docs say that won't work-

      This switch (recover) will only work with XML data rather than HTML data.

      (Update, July 2010: this does work and has for at least a few years -- don't know if it was always the case; I should have tested back when I initially responded.)

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://224272]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (15)
As of 2015-07-31 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (278 votes), past polls