Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Web scraping toolkit?

by Corion (Pope)
on Jan 26, 2012 at 16:03 UTC ( #950146=note: print w/replies, xml ) Need Help??

in reply to Web scraping toolkit?

Personally, I also wrote App::scrape to hide away my extraction library consisting of HTML::TreeBuilder::XPath and HTML::TokeParser.

But that library only deals with convenient extraction from HTML, not with the navigation etc.

I like the navigation and extraction API of WWW::Mechanize::Firefox, which is mostly a combination of the APIs of HTML::TreeBuilder::XPath and the API of WWW::Mechanize. Most likely, this sympathy is because I'm the author of that module.

The best approach to a simplicistic boilerplate approach I've seen is Querylet, which is a source filter that describes DBI reports. Maybe you can reformulate your extractions in a language like it. I wrote (but never used in production so far) a source-filter-less, pluggable version of Querylet at, so if you dislike source filters but like the general language format, you can maybe reuse that parser instead.

Replies are listed 'Best First'.
Re^2: Web scraping toolkit?
by mzedeler (Pilgrim) on Jan 27, 2012 at 08:44 UTC
    I think that App::scrape may turn out to be insufficient, not covering some edge cases that needs handling. But again - thats my general worry, not having tried any of the scraping modules yet (the same goes for Web::Scraper and Scrappy).

    WWW::Mechanize::Firefox looks very promising, and implementing the few extra features that Scrapie has (logging and such) shouldn't be a problem. The real drawback lies in having to rely on firefox (or some similar component) in development and production.

    I'll go back to the drawing board and see what to do. Thanks for the pointers.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://950146]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2018-03-19 19:28 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (246 votes). Check out past polls.