Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

The State of Web spidering in Perl

by digital_carver (Sexton)
on Sep 22, 2013 at 13:48 UTC ( #1055183=perlmeditation: print w/ replies, xml ) Need Help??

Dear fellow monks,

I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.

(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)

Starting from:

  • Good old WWW::Mechanize and HTML::TreeBuilder (Mojo::UserAgent and Mojo::DOM seem to be basically equivalent, I haven't tried them).
    • Comments: Gets the job done, gives you full control (edit the HTML before parsing if you want, get HTML dumps easily, etc.), but the code ends up quite verbose and boilerplated.
  • Scrappy
  • Gungho
  • YADA - just came across it, haven't used it yet.
  • Web::Scraper
    • Comments: This is what I'm using now, the DSL syntax is nice though a bit under-documented, and I had to peek into the sources quite a bit to either understand or customize many things.
The main reason I'm making this post is that I seem to be stumbling upon good scraping frameworks randomly, so it's quite possible I'm missing some really good framework that Google just hasn't divined to show me. So, I'd like to get the opinion of revered monks on this topic.

Comment on The State of Web spidering in Perl
Re: The State of Web spidering in Perl
by brianski (Novice) on Sep 22, 2013 at 15:43 UTC
    Speaking of good and old, what about LWP instead of WWW::Mechanize, and HTML::Parser instead of HTML::TreeBuilder? I have a bunch of production code that uses it, and it's been working flawlessly (modulo the bugs we introduce ;) for over a decade...

      I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div[@id='blah']/p though, do you explicitly maintain state?

      As for LWP vs Mech, LWP does work for my use case, I just prefer Mech for a few niceties like autocheck, auto-delegation of $mech->content() to $response->decoded_content(), cookie_jar defaulting to on, etc.

Re: The State of Web spidering in Perl
by Anonymous Monk on Sep 23, 2013 at 00:35 UTC
Re: The State of Web spidering in Perl ( WWW::WebKit controls Gtk3::WebKit )
by Anonymous Monk on Sep 28, 2013 at 23:05 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1055183]
Approved by Arunbear
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2014-09-21 16:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (172 votes), past polls