Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Dear fellow monks,

I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.

(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)

Starting from:

  • Good old WWW::Mechanize and HTML::TreeBuilder (Mojo::UserAgent and Mojo::DOM seem to be basically equivalent, I haven't tried them).
    • Comments: Gets the job done, gives you full control (edit the HTML before parsing if you want, get HTML dumps easily, etc.), but the code ends up quite verbose and boilerplated.
  • Scrappy
  • Gungho
  • YADA - just came across it, haven't used it yet.
  • Web::Scraper
    • Comments: This is what I'm using now, the DSL syntax is nice though a bit under-documented, and I had to peek into the sources quite a bit to either understand or customize many things.
The main reason I'm making this post is that I seem to be stumbling upon good scraping frameworks randomly, so it's quite possible I'm missing some really good framework that Google just hasn't divined to show me. So, I'd like to get the opinion of revered monks on this topic.

In reply to The State of Web spidering in Perl by digital_carver

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (7)
    As of 2014-08-23 16:50 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (175 votes), past polls