Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^3: Web Scraping on CGI Scripts?

by tospo (Hermit)
on Oct 11, 2011 at 08:32 UTC ( #930763=note: print w/ replies, xml ) Need Help??

in reply to Re^2: Web Scraping on CGI Scripts?
in thread Web Scraping on CGI Scripts?

oh and I forgot to mention: you are always parsing the HTML output that the server sends to you. It doesn't matter that this is a cgi script generating the page on the server, the output is just HTML (unless it's a webservice that sends XML, JSON or the like). So there is nothing special about this case.

Comment on Re^3: Web Scraping on CGI Scripts?
Replies are listed 'Best First'.
Re^4: Web Scraping on CGI Scripts?
by fraizerangus (Sexton) on Oct 12, 2011 at 18:58 UTC
    Hello Again

    WWW::Mechanize does seem to be the right medicine but I've already hit a snag on the road; I'm only interested in following the 'motion.cgi' links and extracting these links as text documents however the regex I've used only finds the first 2 links? Any ideas on whats going on?

    #!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( '' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi? +/ ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; }
    best wishes


      that's because after the first "follow_link" action, $mech_cgi is now on a different page (it behaves like a browser) and then you issue the next follow_link command but that links doesn't actually exist on the page you are on now. Add "$mech_cgi->back" before teh end of the loop and you will iterate through all the links.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://930763]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (13)
As of 2015-10-13 18:31 GMT
Find Nodes?
    Voting Booth?

    Does Humor Belong in Programming?

    Results (312 votes), past polls