Re: Regular Expression Help


more useful options
	PerlMonks

Re: Regular Expression Help

by ton (Friar)

on Sep 01, 2001 at 20:39 UTC ( [id://109651]=note: print w/replies, xml )

Need Help??

in reply to Regular Expression Help

I use XML::Parser for all my XML (and therefore HTML) parsing needs. Be warned that you need to have expat installed on your machine.

Good luck!

-Ton
-----
Be bloody, bold, and resolute; laugh to scorn
The power of man...

Comment on Re: Regular Expression Help

Replies are listed 'Best First'.
HTML and XML by Agermain (Scribe) on Sep 02, 2001 at 04:54 UTC
HTML (3.0, 4.0, etc.) is not a subset of XML, at least not until you get to the XHTML stage. XML and HTML are each subsets of SGML. The main reason I bring up this point is that HTML is - by and large - not well-formed. I'm willing to bet most XML parsers will choke on a common HTML page, simply because most HTML pages aren't structured properly. A `<P>` tag without a corresponding `</P>` tag would probably be the second most common offense, not to mention `<IMG SRC="blah.gif">` doesn't have a slash terminator; neither of which are smiled upon in XML. Granted, it's a moot point if you hand-craft the HTML code going into your programs, but if you're analyzing other websites, assuming that they have properly-structured HTML is probably an unwise programming move, IMO. andre germain "Wherever you go, there you are."	[reply] [d/l] [select]
Re: HTML and XML by mirod (Canon) on Sep 02, 2001 at 11:46 UTC
There are several ways to go from HTML to XML, so you can use XML tools with it: install XML::PYX (and HTML::TreeBuilder) and do `pyxhtml file.html \| pyxw > file.xml`, use tidy. Just do `tidy --output-xhtml yes file.html > file.xml`. Note that you can get a Perl wrapper for it: sl-tidy.pl Note that if you are only working with HTML it might not be really usefull to convert everything to XML, and you might want to use HTML::Parser instead.	[reply]

In Section Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://109651]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others taking refuge in the Monastery: (4)

As of 2024-04-24 22:46 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found