HTML (3.0, 4.0, etc.) is not a subset of XML, at least not until you get to the XHTML stage. XML and HTML are each subsets of SGML. The main reason I bring up this point is that HTML is - by and large - not well-formed. I'm willing to bet most XML parsers will choke on a common HTML page, simply because most HTML pages aren't structured properly. A <P> tag without a corresponding </P> tag would probably be the second most common offense, not to mention <IMG SRC="blah.gif"> doesn't have a slash terminator; neither of which are smiled upon in XML.
Granted, it's a moot point if you hand-craft the HTML code going into your programs, but if you're analyzing other websites, assuming that they have properly-structured HTML is probably an unwise programming move, IMO.
andre germain
"Wherever you go, there you are."
| [reply] [d/l] [select] |
There are several ways to go from HTML to XML, so you can use XML tools with it:
- install XML::PYX (and HTML::TreeBuilder)
and do pyxhtml file.html | pyxw > file.xml,
- use tidy. Just do tidy --output-xhtml yes file.html > file.xml. Note that you can get a Perl wrapper for it: sl-tidy.pl
Note that if you are only working with HTML it might not be really usefull to convert everything to XML, and you might want to use HTML::Parser instead.
| [reply] |