Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: Web Scraping on CGI Scripts?

by tospo (Hermit)
on Oct 11, 2011 at 08:32 UTC ( #930763=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Web Scraping on CGI Scripts?
in thread Web Scraping on CGI Scripts?

oh and I forgot to mention: you are always parsing the HTML output that the server sends to you. It doesn't matter that this is a cgi script generating the page on the server, the output is just HTML (unless it's a webservice that sends XML, JSON or the like). So there is nothing special about this case.


Comment on Re^3: Web Scraping on CGI Scripts?
Re^4: Web Scraping on CGI Scripts?
by fraizerangus (Sexton) on Oct 12, 2011 at 18:58 UTC
    Hello Again

    WWW::Mechanize does seem to be the right medicine but I've already hit a snag on the road; I'm only interested in following the 'motion.cgi' links and extracting these links as text documents however the regex I've used only finds the first 2 links? Any ideas on whats going on?

    #!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi? +/ ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; }
    best wishes

    Dan

      that's because after the first "follow_link" action, $mech_cgi is now on a different page (it behaves like a browser) and then you issue the next follow_link command but that links doesn't actually exist on the page you are on now. Add "$mech_cgi->back" before teh end of the loop and you will iterate through all the links.
Reaped: Re^4: Web Scraping on CGI Scripts?
by NodeReaper (Curate) on Oct 12, 2011 at 21:37 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://930763]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-10-31 12:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (217 votes), past polls