Dear fellow monks,
I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.
(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)
Starting from:
- Good old WWW::Mechanize and HTML::TreeBuilder (Mojo::UserAgent and Mojo::DOM seem to be basically equivalent, I haven't tried them).
- Comments: Gets the job done, gives you full control (edit the HTML before parsing if you want, get HTML dumps easily, etc.), but the code ends up quite verbose and boilerplated.
- Scrappy
- Comments: Looks interesting, but the docs are a bit scattered and felt confusing, and the development seems to have stagnated.
- Gungho
- Comments: Looks perfect, with async IO, automatic robots.txt handling and actual built-in logging, but unfortunatley development seems to have stopped here too.
- YADA - just came across it, haven't used it yet.
- Web::Scraper
- Comments: This is what I'm using now, the DSL syntax is nice though a bit under-documented, and I had to peek into the sources quite a bit to either understand or customize many things.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: The State of Web spidering in Perl
by brianski (Novice) on Sep 22, 2013 at 15:43 UTC | |
by digital_carver (Sexton) on Sep 22, 2013 at 16:49 UTC | |
by Anonymous Monk on Sep 23, 2013 at 00:03 UTC | |
Re: The State of Web spidering in Perl
by Anonymous Monk on Sep 23, 2013 at 00:35 UTC | |
Re: The State of Web spidering in Perl ( WWW::WebKit controls Gtk3::WebKit )
by Anonymous Monk on Sep 28, 2013 at 23:05 UTC | |
Re: The State of Web spidering in Perl
by Corion (Patriarch) on Sep 16, 2015 at 07:52 UTC | |
Re: The State of Web spidering in Perl ( WWW::Mechanize::PhantomJS )
by Anonymous Monk on Sep 16, 2015 at 00:07 UTC |
Back to
Meditations