http://www.perlmonks.org?node_id=1055183

Dear fellow monks,

I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.

(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)

Starting from:

The main reason I'm making this post is that I seem to be stumbling upon good scraping frameworks randomly, so it's quite possible I'm missing some really good framework that Google just hasn't divined to show me. So, I'd like to get the opinion of revered monks on this topic.

Replies are listed 'Best First'.
Re: The State of Web spidering in Perl
by brianski (Novice) on Sep 22, 2013 at 15:43 UTC
    Speaking of good and old, what about LWP instead of WWW::Mechanize, and HTML::Parser instead of HTML::TreeBuilder? I have a bunch of production code that uses it, and it's been working flawlessly (modulo the bugs we introduce ;) for over a decade...

      I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div[@id='blah']/p though, do you explicitly maintain state?

      As for LWP vs Mech, LWP does work for my use case, I just prefer Mech for a few niceties like autocheck, auto-delegation of $mech->content() to $response->decoded_content(), cookie_jar defaulting to on, etc.

Re: The State of Web spidering in Perl
by Anonymous Monk on Sep 23, 2013 at 00:35 UTC
Re: The State of Web spidering in Perl ( WWW::WebKit controls Gtk3::WebKit )
by Anonymous Monk on Sep 28, 2013 at 23:05 UTC
Re: The State of Web spidering in Perl
by Corion (Patriarch) on Sep 16, 2015 at 07:52 UTC

    So far, I haven't found a good generic approach to web scraping. The approach I usually use has condensed into App::scrape, which, while useful in itself, is more the general toolkit I use:

    It uses

    Of course, for the more complex tasks, LWP::Simple has to be replaced by WWW::Mechanize or something speaking Javascript, but if I need more complex navigation, I found no approach to a library or framework that makes this easier other than writing code for the WWW::Mechanize API directly.

Re: The State of Web spidering in Perl ( WWW::Mechanize::PhantomJS )
by Anonymous Monk on Sep 16, 2015 at 00:07 UTC