The State of Web spidering in Perl

Dear fellow monks,

I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.

(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)

Starting from:

Good old WWW::Mechanize and HTML::TreeBuilder (Mojo::UserAgent and Mojo::DOM seem to be basically equivalent, I haven't tried them).
- Comments: Gets the job done, gives you full control (edit the HTML before parsing if you want, get HTML dumps easily, etc.), but the code ends up quite verbose and boilerplated.
Scrappy
- Comments: Looks interesting, but the docs are a bit scattered and felt confusing, and the development seems to have stagnated.
Gungho
- Comments: Looks perfect, with async IO, automatic robots.txt handling and actual built-in logging, but unfortunatley development seems to have stopped here too.
YADA - just came across it, haven't used it yet.
Web::Scraper
- Comments: This is what I'm using now, the DSL syntax is nice though a bit under-documented, and I had to peek into the sources quite a bit to either understand or customize many things.

The main reason I'm making this post is that I seem to be stumbling upon good scraping frameworks randomly, so it's quite possible I'm missing some really good framework that Google just hasn't divined to show me. So, I'd like to get the opinion of revered monks on this topic.

Comment on The State of Web spidering in Perl

Replies are listed 'Best First'.
Re: The State of Web spidering in Perl by brianski (Novice) on Sep 22, 2013 at 15:43 UTC
Speaking of good and old, what about LWP instead of WWW::Mechanize, and HTML::Parser instead of HTML::TreeBuilder? I have a bunch of production code that uses it, and it's been working flawlessly (modulo the bugs we introduce ;) for over a decade...	[reply]
Re^2: The State of Web spidering in Perl by digital_carver (Sexton) on Sep 22, 2013 at 16:49 UTC
I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like `//div[@id='blah']/p` though, do you explicitly maintain state? As for LWP vs Mech, LWP does work for my use case, I just prefer Mech for a few niceties like `autocheck`, auto-delegation of `$mech->content()` to `$response->decoded_content()`, `cookie_jar` defaulting to on, etc.	[reply] [d/l] [select]
Re^3: The State of Web spidering in Perl by Anonymous Monk on Sep 23, 2013 at 00:03 UTC
I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div@id='blah'/p though, do you explicitly maintain state? You don't -- you might use HTML::Parser if you want to reinvent HTML::Tree. Its like XML::Parser, you might use it if you want to reinvent XML::Twig, but since both Tree/Twig exist and do a fantastic job already , don't waste your time reinventing them :) And now my linkdump of examples docs tutorials ... because xml::parser is low level, you should parse html/xml with xpath/twig/dom, Re: How to grab a portion of file with regex (don't), HTML Parser suggestions See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex How do I match XML, HTML, or other nasty, ugly things with a regex? How do I remove HTML from a string? Re: Parsing webpages See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions See also htmltreexpather.pl and xpather.pl htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions xpather.pl Re: Get Node Value from irregular XML (xpather.pl) Re: Having trouble with siblings Re^2: XML parsing and Lists Re: Counting number of child nodes based on element value (typos) Re^3: Extracting specific childnodes (xpath whitespace) Re^3: Extracting specific childnodes (play xmllint --shell ) Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How to parse xml with namespase vale in XMl:LibXML? ( XPath error : Undefined namespace prefix ) Re^2: How to parse xml with namespase vale in XMl:LibXML? (xmllint --shell setns / xpathtester) There is a better way :)	[reply]
Re: The State of Web spidering in Perl by Anonymous Monk on Sep 23, 2013 at 00:35 UTC
Scrappy Comments: Looks interesting, but the docs are a bit scattered and felt confusing, and the development seems to have stagnated. scrappy is too much pee as I say in Re: Scrappy user_agent error ... spidering ... scraping ... You're confusing yourself there a little, spidering (anything goes) is a completely different ballgame than scraping (this one particular site) WWW::Mechanize makes LWP bearable for scraping WWW::Mechanize::Firefox adds JS support that's it for scraping, the bare essentials and the state of the art, all the others add a little sugar and some baggage WWW::Scripter adds "JS support" WWW::Scripter::Plugin::JavaScript/WWW::Scripter::Plugin::Ajax... very impressive/neat, but it ain't no WWW::Mechanize::Firefox Web::Scraper adds a little sugar, like Web::Magic adds maximum magic with maximum dependencies Web::Query adds maximum sugar with minimal dependencies, like jQuery, but has odd bits -Mojo is slick but Mojo::DOM is still barren (pita) XML::LibXML -- has work to do, it can load html/xml documents from urls but the headers/post/... HTTP stuff isn't there yet Mozilla::Mechanize -- good luck building that :) it ain't easy, no it aint easy Gtk3::WebKit -- its a browser , you might scrape with it somehow (probably not) Gtk2::WebKit -- its a browser , you might scrape with it somehow (probably not) Wx::Htmlwindow -- its a (old/weak/limited) browser , you can scrape with it somehow , its clumsy and limited, not good for scraping Wx::WebView-- its a (new/moderner/css+js) browser , even more useless for scraping than wx::htmlwindow ... looks nice but like all these browsers, not designed for scraping, although it could be	[reply]
Re: The State of Web spidering in Perl ( WWW::WebKit controls Gtk3::WebKit ) by Anonymous Monk on Sep 28, 2013 at 23:05 UTC
discovered via http://blogs.perl.org/users/robhammond/2013/02/web-scraping-with-perl-phantomjs.html WWW::WebKit - Perl extension for controlling an embedding WebKit engine (Gtk3::WebKit) Wight - drive PhantomJS: Headless WebKit with JavaScript API	[reply]
Re: The State of Web spidering in Perl by Corion (Patriarch) on Sep 16, 2015 at 07:52 UTC
So far, I haven't found a good generic approach to web scraping. The approach I usually use has condensed into App::scrape, which, while useful in itself, is more the general toolkit I use: It uses HTML::TreeBuilder::XPath for parsing the HTML and querying the tree as XPath expressions HTML::Selector::XPath for querying the tree using CSS selectors LWP::Simple for page fetching Of course, for the more complex tasks, LWP::Simple has to be replaced by WWW::Mechanize or something speaking Javascript, but if I need more complex navigation, I found no approach to a library or framework that makes this easier other than writing code for the WWW::Mechanize API directly.	[reply]
Re: The State of Web spidering in Perl ( WWW::Mechanize::PhantomJS ) by Anonymous Monk on Sep 16, 2015 at 00:07 UTC
Since 2014, from the Corion :D WWW::Mechanize::PhantomJS - automate the PhantomJS browser http://phantomjs.org/download.html provides binaries, but the windows ones require higher than WinXP :/	[reply]

Back to Meditations