Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hi monks,

I'm trying to scrape a large website. There are two single select drop down lists that refresh the page and populate a third single select drop down list. After selecting from this list, you click on one of 8 links below. The URL for this in the tag in the page HTML is "#", and it says onClick="tohtm('../.*.php'). In Firefox this opens up a new page/tab and brings you to a data table whose contents I need.

I'm using WWW::Mechanize for this. I can log in through the first page of this site, and follow a link to get to the page described above. Then I've tried selecting the two first single selects (after selecting the form they're inside by name), but that doesn't seem to work. The responses I get still have an unpopulated last drop down select control.

Luckily changing those first two selects brings you to a different URL. So I've also tried just $browser->get()ting that URL and then trying to select and submit/click from the 3rd drop down select menu. Then I've tried following the link to the data through the follow_link function, but this just brings me back to the same page, with a "#" tacked onto the URL given. I've also tried just getting the URL for the data page directly after selecting from the third drop down menu, but that gives me a page with an empty data table that isn't empty when accessed properly through the browser.

Below are some snippets from the HTML of the page I'm working with and the key lines from the code I'm trying to get to work.

<form name="sipp" method="post" target="_blank">
input name="ses_id" type="hidden" value="sid">
<select name="fiscal" size="1" style="width:150" onChange="enableIt(this,document.sipp.propinsi); gatherInfothn(this, 'thn='); getval(this,document.sipp.thnang);">
<option value="0">Pilih Tahun</option>
<option value="2008" selected>2008</option>
<option value="2007">2007</option>
<option value="2006">2006</option>
<option value="2005">2005</option>
<option value="2004">2004</option>
<input type="hidden" name="thnang" value="">
... <select name="propinsi" style="width:150" onChange="enableIt(this,document.sipp.proyektemp); gatherInfoProp(document.sipp.fiscal, this, 'thn=','&kdprop='); getName(this,document.sipp.nmpropinsi);">
<option value="0">Pilih Propinsi</option>
<option value="01" selected>DKI Jakarta </option>
<option value="02">Jawa Barat </option>
<option value="03">Jawa Tengah </option>

And the links to the data table I need look like this. Note the dots in the HTML tags are just so this shows up looking right here:

< class="namaForm">Form A-3<../td>
< class="content" onMouseOver="this.bgColor='#EAEAEA'" onMouseOut="this.bgColor='#FFFFFF'">
<..a href="#" onClick="tohtm('../sipp2005/form_A3.php')">Laporan Paket Kontrak<../a>
< align="center" class="clickableTXT"><img src="../images/xls.gif" alt="Simpan ke file Excel dan Print" width="20" height="20" onClick="toxls('../sipp2005/form_A3.php')"><../td>

Finally, here's some of my code. This comes after I've already logged in and followed a link the page where the HTML above comes from.

$br->get('http://.../sipp.php?thn=2008&kdprop=01'); # $br is initialized from mechanize: my $br = WWW::Mechanize->new(); # Set ->agent_alias('Windows IE 6'); my $resp = $br->content(); $resp =~ s/\x0D//g; # On a mac here, get ^M at the end of each line my @pt_vals = get_proyektemp_values($resp); #Don't want to use mech-dump, so just regexing the newly populated val +ues of the 3rd drop down menu $br->form_name('sipp'); $br->field('proyektemp', "$pt_vals[1]"); $br->submit();
Then I've tried both of the following.
#1 my $link_resp = $br->follow_link(text_regex => qr/paket\s+kontrak/i); #2 $br->get('.../sipp2005/form_A3.php');

P1 just brings me back to the same page I started on. 2 gets me to the data table page, but with an (incorrectly) empty table. Am I just being a newbie web programmer idiot? Is this a Javascript problem? Are these select controls and links all calling javascript functions, which aren't interpreted in Mechanize? Are there other libraries that would scrape this page successfully? I've also tried the Python version of Mechanize, but had no success there either.

Thanks, Matt

In reply to Mechanize, Forms, Links, problem from Javascript? by goodepic

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the grasshoppers chirp...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (10)
    As of 2017-03-29 13:44 GMT
    Find Nodes?
      Voting Booth?
      Should Pluto Get Its Planethood Back?

      Results (351 votes). Check out past polls.