Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Web Scraping on CGI Scripts?

by tospo (Hermit)
on Oct 10, 2011 at 08:16 UTC ( #930548=note: print w/ replies, xml ) Need Help??


in reply to Web Scraping on CGI Scripts?

If you are scraping a web page then it will be HTML. Or are you trying to parse output from a web service that sends a response in something like XML or JSON format? There are modules to handle these scenarios but it is important first to know what you are dealing with. Can you be more precise and maybe give a URL?


Comment on Re: Web Scraping on CGI Scripts?
Re^2: Web Scraping on CGI Scripts?
by Anonymous Monk on Oct 10, 2011 at 16:29 UTC
    Hi Tospo The URL is http://www.molmovdb.org/cgi-bin/browse.cgi I'm trying to follow all the links to the database enties iteratively and output these as text files to analyse later as you can probably see the coding is not formatted amazingly well!? many thanks and best wishes Dan
      That page - apart from being marked-up in a rather old-fashioned way - isn't too bad at all. If you look at the page source code, you can easily see a table structure that you can use to parse it.
      You will want to use a module like WWW::Mechanize to interact with the website. This moduel allows you to interact with web content like a user would in a browser. You can make your script "click" on links, to get to the text files. Use the table structure of the "browse" page to iterate over all the molecules, each time following the link through to the text data files.
      Have a go with a simple example first. There are a few here. If you are getting stuck, post the script you have so far and what's happening so we can help you along. Good luck!
      oh and I forgot to mention: you are always parsing the HTML output that the server sends to you. It doesn't matter that this is a cgi script generating the page on the server, the output is just HTML (unless it's a webservice that sends XML, JSON or the like). So there is nothing special about this case.
        Hello Again

        WWW::Mechanize does seem to be the right medicine but I've already hit a snag on the road; I'm only interested in following the 'motion.cgi' links and extracting these links as text documents however the regex I've used only finds the first 2 links? Any ideas on whats going on?

        #!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi? +/ ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; }
        best wishes

        Dan

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://930548]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (8)
As of 2014-09-16 23:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (51 votes), past polls