Dear fellow monks,

I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.

(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)

Starting from:

The main reason I'm making this post is that I seem to be stumbling upon good scraping frameworks randomly, so it's quite possible I'm missing some really good framework that Google just hasn't divined to show me. So, I'd like to get the opinion of revered monks on this topic.

Comment on The State of Web spidering in Perl
Re: The State of Web spidering in Perl
by brianski (Novice) on Sep 22, 2013 at 15:43 UTC
    Speaking of good and old, what about LWP instead of WWW::Mechanize, and HTML::Parser instead of HTML::TreeBuilder? I have a bunch of production code that uses it, and it's been working flawlessly (modulo the bugs we introduce ;) for over a decade...

      I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div[@id='blah']/p though, do you explicitly maintain state?

      As for LWP vs Mech, LWP does work for my use case, I just prefer Mech for a few niceties like autocheck, auto-delegation of $mech->content() to $response->decoded_content(), cookie_jar defaulting to on, etc.

Re: The State of Web spidering in Perl
by Anonymous Monk on Sep 23, 2013 at 00:35 UTC
Re: The State of Web spidering in Perl ( WWW::WebKit controls Gtk3::WebKit )
by Anonymous Monk on Sep 28, 2013 at 23:05 UTC