Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: The State of Web spidering in Perl

by Anonymous Monk
on Sep 23, 2013 at 00:35 UTC ( #1055208=note: print w/ replies, xml ) Need Help??


in reply to The State of Web spidering in Perl

Scrappy Comments: Looks interesting, but the docs are a bit scattered and felt confusing, and the development seems to have stagnated.

scrappy is too much pee as I say in Re: Scrappy user_agent error

... spidering ... scraping ...

You're confusing yourself there a little, spidering (anything goes) is a completely different ballgame than scraping (this one particular site)

WWW::Mechanize makes LWP bearable for scraping

WWW::Mechanize::Firefox adds JS support

that's it for scraping, the bare essentials and the state of the art, all the others add a little sugar and some baggage

WWW::Scripter adds "JS support" WWW::Scripter::Plugin::JavaScript/WWW::Scripter::Plugin::Ajax... very impressive/neat, but it ain't no WWW::Mechanize::Firefox

Web::Scraper adds a little sugar, like

Web::Magic adds maximum magic with maximum dependencies

Web::Query adds maximum sugar with minimal dependencies, like jQuery, but has odd bits

-Mojo is slick but Mojo::DOM is still barren (pita)

XML::LibXML -- has work to do, it can load html/xml documents from urls but the headers/post/... HTTP stuff isn't there yet

 


Mozilla::Mechanize -- good luck building that :) it ain't easy, no it aint easy

Gtk3::WebKit -- its a browser , you might scrape with it somehow (probably not)

Gtk2::WebKit -- its a browser , you might scrape with it somehow (probably not)

Wx::Htmlwindow -- its a (old/weak/limited) browser , you can scrape with it somehow , its clumsy and limited, not good for scraping

Wx::WebView-- its a (new/moderner/css+js) browser , even more useless for scraping than wx::htmlwindow ... looks nice but like all these browsers, not designed for scraping, although it could be


Comment on Re: The State of Web spidering in Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1055208]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (12)
As of 2015-07-29 22:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (269 votes), past polls