Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Web scraping toolkit?

by mzedeler (Pilgrim)
on Jan 26, 2012 at 15:41 UTC ( #950135=perlquestion: print w/ replies, xml ) Need Help??
mzedeler has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow perl monks.

I need to organize the development of some 50+ small web scrapers for a similar number of pages on the Internet. The scrapers parse and extract data of similar structure across the different data sources.

So far, a few scripts has been written using WWW::Mechanize, HTML::TreeBuilder::XPath or HTML::TokeParser. This has worked fairly well, but I can see that there is a lot of boilerplate code across the scripts that could be reused. Also, I know that in some respect, we need a toolkit that doesn't give us too many ways to solve the same problem, so we can somewhat standardize the code.

I took a look at Scrappy, but the fact that it uses Web::Scraper, which in turn seems to be only partly documented has somewhat put me off.

Does anyone have any recommendations wrt. good web scraping toolkits?

Regards,

Michael.

Comment on Web scraping toolkit?
Re: Web scraping toolkit?
by Anonymous Monk on Jan 26, 2012 at 15:44 UTC

    Does anyone have any recommendations wrt. good web scraping toolkits?

    IMHO it doesn't get better than that :)

    Ok, maybe Web::Magic

    but I say, if you've got 50 programs for 50 sites, you should be building on that experience to write your own Mechanize subclass (or more) to minimaze the boilerplate

    FWIW, I embrace the boilerplate

      Thanks for the suggestion. I've started working with some reusable components that can handle repeated tasks and it's definately part of the solution.
Re: Web scraping toolkit?
by Corion (Pope) on Jan 26, 2012 at 16:03 UTC

    Personally, I also wrote App::scrape to hide away my extraction library consisting of HTML::TreeBuilder::XPath and HTML::TokeParser.

    But that library only deals with convenient extraction from HTML, not with the navigation etc.

    I like the navigation and extraction API of WWW::Mechanize::Firefox, which is mostly a combination of the APIs of HTML::TreeBuilder::XPath and the API of WWW::Mechanize. Most likely, this sympathy is because I'm the author of that module.

    The best approach to a simplicistic boilerplate approach I've seen is Querylet, which is a source filter that describes DBI reports. Maybe you can reformulate your extractions in a language like it. I wrote (but never used in production so far) a source-filter-less, pluggable version of Querylet at https://github.com/Corion/querylet/tree/pluggable, so if you dislike source filters but like the general language format, you can maybe reuse that parser instead.

      I think that App::scrape may turn out to be insufficient, not covering some edge cases that needs handling. But again - thats my general worry, not having tried any of the scraping modules yet (the same goes for Web::Scraper and Scrappy).

      WWW::Mechanize::Firefox looks very promising, and implementing the few extra features that Scrapie has (logging and such) shouldn't be a problem. The real drawback lies in having to rely on firefox (or some similar component) in development and production.

      I'll go back to the drawing board and see what to do. Thanks for the pointers.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950135]
Approved by rovf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-09-18 03:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (105 votes), past polls