Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Parsing HTML Documents

by itsscott (Acolyte)
on Sep 26, 2011 at 17:31 UTC ( #927906=perlquestion: print w/ replies, xml ) Need Help??
itsscott has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks. I know this kind of question has been asked somewhat, and to that many answers have come out, and the use of HTML::TokeParser is a wonderful tool for most of my needs.

I have some more complicated things that I want to do with pages I have consumed and parsed with said module. I would like to use xpath to work with the HTML page that I have consumed, as I do with the XML reports I create/read. The problem I am having is two fold. First, XML::LibXML is ok if there are no ampersands and that the web page is strict. As we all know this is not typical and almost all pages that I come across on client sites are broken to a small degree or use an ampersand.

I have seen a few modules that allow you to access Mozilla/Webkit engines, but seem that they will actually launch a browser instance from what I have read, and this isn't what I want. Ideally I'd like to have the ability to consume a web page into a DOM object with a mainstream parser (Mozilla/Webkit etc) and then be able to extract nodes and objects via xpath, again, as I do with LibXML and XML files.

Thanks in advance for any insight / suggestions

Comment on Parsing HTML Documents
Replies are listed 'Best First'.
Re: Parsing HTML Documents
by Anonymous Monk on Sep 27, 2011 at 00:19 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://927906]
Approved by davies
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2015-08-04 04:27 GMT
Find Nodes?
    Voting Booth?

    The oldest computer book still on my shelves (or on my digital media) is ...

    Results (59 votes), past polls