Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Parsing HTML Documents

by itsscott (Sexton)
on Sep 26, 2011 at 17:31 UTC ( #927906=perlquestion: print w/replies, xml ) Need Help??
itsscott has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks. I know this kind of question has been asked somewhat, and to that many answers have come out, and the use of HTML::TokeParser is a wonderful tool for most of my needs.

I have some more complicated things that I want to do with pages I have consumed and parsed with said module. I would like to use xpath to work with the HTML page that I have consumed, as I do with the XML reports I create/read. The problem I am having is two fold. First, XML::LibXML is ok if there are no ampersands and that the web page is strict. As we all know this is not typical and almost all pages that I come across on client sites are broken to a small degree or use an ampersand.

I have seen a few modules that allow you to access Mozilla/Webkit engines, but seem that they will actually launch a browser instance from what I have read, and this isn't what I want. Ideally I'd like to have the ability to consume a web page into a DOM object with a mainstream parser (Mozilla/Webkit etc) and then be able to extract nodes and objects via xpath, again, as I do with LibXML and XML files.

Thanks in advance for any insight / suggestions

Replies are listed 'Best First'.
Re: Parsing HTML Documents
by Anonymous Monk on Sep 27, 2011 at 00:19 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://927906]
Approved by davies
[karlgoethebier]: shmem:Ach so. Bollerwagen ist eigentlich was für Kinder. Bei uns fahren sie mit Traktoren und Feuerwehrwagen rum...
[shmem]: karlgoethebier: here abouts people are sitting in big tents, gulp beer and ingest badly scorched chicken and sausages while (not) listening to afwul musik
[shmem]: *awful
[choroba]: panem et circenses
[karlgoethebier]: shmem: Yes. I didn't go to the firefighters party just 500 yards from my home this year...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2017-05-25 12:44 GMT
Find Nodes?
    Voting Booth?