Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^4: Web Scraping on CGI Scripts?

by fraizerangus (Sexton)
on Oct 12, 2011 at 18:58 UTC ( #931067=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Web Scraping on CGI Scripts?
in thread Web Scraping on CGI Scripts?

Hello Again

WWW::Mechanize does seem to be the right medicine but I've already hit a snag on the road; I'm only interested in following the 'motion.cgi' links and extracting these links as text documents however the regex I've used only finds the first 2 links? Any ideas on whats going on?

#!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi? +/ ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; }
best wishes

Dan


Comment on Re^4: Web Scraping on CGI Scripts?
Download Code
Re^5: Web Scraping on CGI Scripts?
by tospo (Hermit) on Oct 13, 2011 at 08:56 UTC
    that's because after the first "follow_link" action, $mech_cgi is now on a different page (it behaves like a browser) and then you issue the next follow_link command but that links doesn't actually exist on the page you are on now. Add "$mech_cgi->back" before teh end of the loop and you will iterate through all the links.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://931067]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2014-08-30 07:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls