Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re^2: extracting data from HTML

by Jurassic Monk (Acolyte)
on Jun 03, 2012 at 12:31 UTC ( #974138=note: print w/replies, xml ) Need Help??

in reply to Re: extracting data from HTML
in thread extracting data from HTML


the point is to get it into something I can handle with XPath and do some 'foreach' if needed

and yes, I've read most of O'Reilly:

  • XML
  • Perl & XML
  • XML Schema
  • XSLT
  • XSLT cookbook

And that is the reason I trun to the monnestry, for the answers are not to be found in those scrolls

Replies are listed 'Best First'.
Re^3: extracting data from HTML
by bitingduck (Chaplain) on Jun 04, 2012 at 04:00 UTC

    Don't look for a particular general module that will solve all your HTML to Data problems. Look at the page or pages that you want to extract data from, and figure out what are the best modules for those particular cases. In my experience (which is less than most others here), it's not worth the trouble to find something that will go straight from HTML to appropriately structured XML. Whoever generated the page had some database model and spewed it into some template that they invented, probably with no thought whatsoever in making it easy to turn it back into data. Or they didn't even do things in a consistent way, making your problem in inverting it even worse.

    If you have access to a lot of O'Reilly stuff, don't look at the general books. Look at a practical one--I started HTML scraping with recipes out of Spidering Hacks and still refer back to it occasionally.

    Here's a recent example (after the more tag) where I had a bunch of pages on a website that I wanted to copy book metadata from all the pages and put it into XML so I could generate a catalog from the XML. The catch is that the pages were all hand coded. They did a pretty good job using CSS to identify the relevant parts, but there were still inconsistencies, and a few of the older pages were so out of whack that they didn't get processed at all.

    If you look at the code, it's pretty specific to the pages I was scraping, so it's ugly in all sorts of ways. It could also be made somewhat simpler if I needed to do it a bunch more times-- it's a bit repetitive in pulling out a bunch of the labeled items, so those could be a loop through an array of names, and maybe add flags to the array for special treatment. There are also extraneous modules called-- the original pages were inconsistent about odd characters and entities, and that was one of the bigger headaches. Note how I find the pieces I want-I know how they're named, so I just do a "look down" to find them, and then process contents from there. Note also that I use XML::Writer to generate the XML, rather than trying to do it myself.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974138]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2017-06-26 16:07 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (583 votes). Check out past polls.