Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

The State of Web spidering in Perl

by digital_carver (Sexton)
on Sep 22, 2013 at 13:48 UTC ( [id://1055183]=perlmeditation: print w/replies, xml ) Need Help??

Dear fellow monks,

I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.

(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)

Starting from:

  • Good old WWW::Mechanize and HTML::TreeBuilder (Mojo::UserAgent and Mojo::DOM seem to be basically equivalent, I haven't tried them).
    • Comments: Gets the job done, gives you full control (edit the HTML before parsing if you want, get HTML dumps easily, etc.), but the code ends up quite verbose and boilerplated.
  • Scrappy
  • Gungho
  • YADA - just came across it, haven't used it yet.
  • Web::Scraper
    • Comments: This is what I'm using now, the DSL syntax is nice though a bit under-documented, and I had to peek into the sources quite a bit to either understand or customize many things.
The main reason I'm making this post is that I seem to be stumbling upon good scraping frameworks randomly, so it's quite possible I'm missing some really good framework that Google just hasn't divined to show me. So, I'd like to get the opinion of revered monks on this topic.

Replies are listed 'Best First'.
Re: The State of Web spidering in Perl
by brianski (Novice) on Sep 22, 2013 at 15:43 UTC
    Speaking of good and old, what about LWP instead of WWW::Mechanize, and HTML::Parser instead of HTML::TreeBuilder? I have a bunch of production code that uses it, and it's been working flawlessly (modulo the bugs we introduce ;) for over a decade...

      I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div[@id='blah']/p though, do you explicitly maintain state?

      As for LWP vs Mech, LWP does work for my use case, I just prefer Mech for a few niceties like autocheck, auto-delegation of $mech->content() to $response->decoded_content(), cookie_jar defaulting to on, etc.

Re: The State of Web spidering in Perl
by Anonymous Monk on Sep 23, 2013 at 00:35 UTC
Re: The State of Web spidering in Perl ( WWW::WebKit controls Gtk3::WebKit )
by Anonymous Monk on Sep 28, 2013 at 23:05 UTC
Re: The State of Web spidering in Perl
by Corion (Patriarch) on Sep 16, 2015 at 07:52 UTC

    So far, I haven't found a good generic approach to web scraping. The approach I usually use has condensed into App::scrape, which, while useful in itself, is more the general toolkit I use:

    It uses

    Of course, for the more complex tasks, LWP::Simple has to be replaced by WWW::Mechanize or something speaking Javascript, but if I need more complex navigation, I found no approach to a library or framework that makes this easier other than writing code for the WWW::Mechanize API directly.

Re: The State of Web spidering in Perl ( WWW::Mechanize::PhantomJS )
by Anonymous Monk on Sep 16, 2015 at 00:07 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1055183]
Approved by Arunbear
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 16:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found