Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Mechanize, Forms, Links, problem from Javascript?

by goodepic (Initiate)
on Jun 20, 2008 at 00:13 UTC ( #693036=perlquestion: print w/ replies, xml ) Need Help??
goodepic has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I'm trying to scrape a large website. There are two single select drop down lists that refresh the page and populate a third single select drop down list. After selecting from this list, you click on one of 8 links below. The URL for this in the tag in the page HTML is "#", and it says onClick="tohtm('../.*.php'). In Firefox this opens up a new page/tab and brings you to a data table whose contents I need.

I'm using WWW::Mechanize for this. I can log in through the first page of this site, and follow a link to get to the page described above. Then I've tried selecting the two first single selects (after selecting the form they're inside by name), but that doesn't seem to work. The responses I get still have an unpopulated last drop down select control.

Luckily changing those first two selects brings you to a different URL. So I've also tried just $browser->get()ting that URL and then trying to select and submit/click from the 3rd drop down select menu. Then I've tried following the link to the data through the follow_link function, but this just brings me back to the same page, with a "#" tacked onto the URL given. I've also tried just getting the URL for the data page directly after selecting from the third drop down menu, but that gives me a page with an empty data table that isn't empty when accessed properly through the browser.

Below are some snippets from the HTML of the page I'm working with and the key lines from the code I'm trying to get to work.

<form name="sipp" method="post" target="_blank">
input name="ses_id" type="hidden" value="sid">
...
<select name="fiscal" size="1" style="width:150" onChange="enableIt(this,document.sipp.propinsi); gatherInfothn(this, 'thn='); getval(this,document.sipp.thnang);">
<option value="0">Pilih Tahun</option>
<option value="2008" selected>2008</option>
<option value="2007">2007</option>
<option value="2006">2006</option>
<option value="2005">2005</option>
<option value="2004">2004</option>
</select>
<input type="hidden" name="thnang" value="">
... <select name="propinsi" style="width:150" onChange="enableIt(this,document.sipp.proyektemp); gatherInfoProp(document.sipp.fiscal, this, 'thn=','&kdprop='); getName(this,document.sipp.nmpropinsi);">
<option value="0">Pilih Propinsi</option>
<option value="01" selected>DKI Jakarta </option>
<option value="02">Jawa Barat </option>
<option value="03">Jawa Tengah </option>

And the links to the data table I need look like this. Note the dots in the HTML tags are just so this shows up looking right here:

<..tr>
<..td class="namaForm">Form A-3<../td>
<..td class="content" onMouseOver="this.bgColor='#EAEAEA'" onMouseOut="this.bgColor='#FFFFFF'">
<..a href="#" onClick="tohtm('../sipp2005/form_A3.php')">Laporan Paket Kontrak<../a>
<../td>
<..td align="center" class="clickableTXT"><img src="../images/xls.gif" alt="Simpan ke file Excel dan Print" width="20" height="20" onClick="toxls('../sipp2005/form_A3.php')"><../td>
<../tr>

Finally, here's some of my code. This comes after I've already logged in and followed a link the page where the HTML above comes from.

$br->get('http://.../sipp.php?thn=2008&kdprop=01'); # $br is initialized from mechanize: my $br = WWW::Mechanize->new(); # Set ->agent_alias('Windows IE 6'); my $resp = $br->content(); $resp =~ s/\x0D//g; # On a mac here, get ^M at the end of each line my @pt_vals = get_proyektemp_values($resp); #Don't want to use mech-dump, so just regexing the newly populated val +ues of the 3rd drop down menu $br->form_name('sipp'); $br->field('proyektemp', "$pt_vals[1]"); $br->submit();
Then I've tried both of the following.
#1 my $link_resp = $br->follow_link(text_regex => qr/paket\s+kontrak/i); #2 $br->get('.../sipp2005/form_A3.php');

P1 just brings me back to the same page I started on. 2 gets me to the data table page, but with an (incorrectly) empty table. Am I just being a newbie web programmer idiot? Is this a Javascript problem? Are these select controls and links all calling javascript functions, which aren't interpreted in Mechanize? Are there other libraries that would scrape this page successfully? I've also tried the Python version of Mechanize, but had no success there either.

Thanks, Matt

Comment on Mechanize, Forms, Links, problem from Javascript?
Select or Download Code
Re: Mechanize, Forms, Links, problem from Javascript?
by Anonymous Monk on Jun 20, 2008 at 03:25 UTC
Re: Mechanize, Forms, Links, problem from Javascript?
by Cody Pendant (Prior) on Jun 20, 2008 at 03:27 UTC
    In a word, yes, it's a javascript issue.

    This stuff here: onChange="enableIt(this,document.sipp.proyektemp); gatherInfoProp(document.sipp.fiscal, this, 'thn=','&kdprop='); getName(this,document.sipp.nmpropinsi); is JS which runs three different functions when something is selected in those menus.

    So, figure out what's actually happening, is the usual advice.

    My preferred way to do this is to use the Live HTTP Headers Add-on for Firefox. No matter what the Javascript does, sooner or later, the browser retrieves a page, from a URL, via HTTP, and once you can figure out what that URL is, you'll be able to write your script.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...

      Thanks guys. HTTP::Recorder looks very cool and useful, but it doesn't deal with Javascript either. It just pumps out the code that I'd already tried.

      So I tried Live HTTP with firefox. Oh man. I'm not super well versed in this stuff. I've just attached my Live HTTP output below. The only thing I've been able to think to try is to get the ../sipp2005/form_A3.php page "manually" by adding the the ses_id line to the full form_A3 URL after a ?, both with the +'s and with them replace by %20, since they appear as spaces in the page itself. That doesn't work even in my browser, already logged in to the site. Any help is GREATLY appreciated...

      There's stuff before this, but none of it, like the login, is javascript dependent, so I can get there fine. I can get to the http://sipp.pu.go.id/sipp/sipp.php?thn=2008&kdprop=01 site just by putting those URLs in after logging (through Mechanize). The thn and kdprop values come from two drop down select controls that use javascript so I can't use them properly but just getting the URL works fine. There's statcounter.com HTTP content after the top one below, but I've deleted that. The second Post is what I need and can't get to work...

        um, http::recorder is easier than live/headers. you use a browser to do what you want (js or no), and http::recorder records the conversation, which you duplicate using mechanize. Otherwise there really is no way without learning http/cgi....
        Well, the output there says that there was a POST request to http://sipp.pu.go.id/sipp2005/form_A3.php with the content
        ses_id=sid&fiscal=2008&thnang=&propinsi=01&proyektemp=0905497004-11003 +7040+-Ir.+Bambang+Erianto%2CMM++++++++&nmpinpro=-Ir.+Bambang+Erianto% +2CMM++++++++&nippin=110037040&nmproyek=PUSAT-SEKRETARIAT-SNVT+PENANGA +NAN+MENDESAK+DAN+TANGGAP+DARURAT&proyek=0905497004&nmpropinsi=DKI+Jak +arta

        Which might be the same, assuming the server accepts GET requests as well as POST requests, as this URL:

        http://sipp.pu.go.id/sipp2005/form_A3.php?ses_id=sid&fiscal=2008&thnan +g=&propinsi=01&proyektemp=0905497004-110037040+-Ir.+Bambang+Erianto%2 +CMM++++++++&nmpinpro=-Ir.+Bambang+Erianto%2CMM++++++++&nippin=1100370 +40&nmproyek=PUSAT-SEKRETARIAT-SNVT+PENANGANAN+MENDESAK+DAN+TANGGAP+DA +RURAT&proyek=0905497004&nmpropinsi=DKI+Jakarta

        Now, I've gone to that URL and ... I can't read Bahasa Indonesia, so I don't know if that's the information you want or an error message.



        Nobody says perl looks like line-noise any more
        kids today don't know what line-noise IS ...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://693036]
Approved by sgifford
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (12)
As of 2014-09-23 16:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (232 votes), past polls