Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Web scraping toolkit?

by Corion (Pope)
on Jan 26, 2012 at 16:03 UTC ( #950146=note: print w/ replies, xml ) Need Help??


in reply to Web scraping toolkit?

Personally, I also wrote App::scrape to hide away my extraction library consisting of HTML::TreeBuilder::XPath and HTML::TokeParser.

But that library only deals with convenient extraction from HTML, not with the navigation etc.

I like the navigation and extraction API of WWW::Mechanize::Firefox, which is mostly a combination of the APIs of HTML::TreeBuilder::XPath and the API of WWW::Mechanize. Most likely, this sympathy is because I'm the author of that module.

The best approach to a simplicistic boilerplate approach I've seen is Querylet, which is a source filter that describes DBI reports. Maybe you can reformulate your extractions in a language like it. I wrote (but never used in production so far) a source-filter-less, pluggable version of Querylet at https://github.com/Corion/querylet/tree/pluggable, so if you dislike source filters but like the general language format, you can maybe reuse that parser instead.


Comment on Re: Web scraping toolkit?
Re^2: Web scraping toolkit?
by mzedeler (Pilgrim) on Jan 27, 2012 at 08:44 UTC
    I think that App::scrape may turn out to be insufficient, not covering some edge cases that needs handling. But again - thats my general worry, not having tried any of the scraping modules yet (the same goes for Web::Scraper and Scrappy).

    WWW::Mechanize::Firefox looks very promising, and implementing the few extra features that Scrapie has (logging and such) shouldn't be a problem. The real drawback lies in having to rely on firefox (or some similar component) in development and production.

    I'll go back to the drawing board and see what to do. Thanks for the pointers.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://950146]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2014-04-20 02:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls