Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hi monks,

I'm trying to scrape a large website. There are two single select drop down lists that refresh the page and populate a third single select drop down list. After selecting from this list, you click on one of 8 links below. The URL for this in the tag in the page HTML is "#", and it says onClick="tohtm('../.*.php'). In Firefox this opens up a new page/tab and brings you to a data table whose contents I need.

I'm using WWW::Mechanize for this. I can log in through the first page of this site, and follow a link to get to the page described above. Then I've tried selecting the two first single selects (after selecting the form they're inside by name), but that doesn't seem to work. The responses I get still have an unpopulated last drop down select control.

Luckily changing those first two selects brings you to a different URL. So I've also tried just $browser->get()ting that URL and then trying to select and submit/click from the 3rd drop down select menu. Then I've tried following the link to the data through the follow_link function, but this just brings me back to the same page, with a "#" tacked onto the URL given. I've also tried just getting the URL for the data page directly after selecting from the third drop down menu, but that gives me a page with an empty data table that isn't empty when accessed properly through the browser.

Below are some snippets from the HTML of the page I'm working with and the key lines from the code I'm trying to get to work.

<form name="sipp" method="post" target="_blank">
input name="ses_id" type="hidden" value="sid">
...
<select name="fiscal" size="1" style="width:150" onChange="enableIt(this,document.sipp.propinsi); gatherInfothn(this, 'thn='); getval(this,document.sipp.thnang);">
<option value="0">Pilih Tahun</option>
<option value="2008" selected>2008</option>
<option value="2007">2007</option>
<option value="2006">2006</option>
<option value="2005">2005</option>
<option value="2004">2004</option>
</select>
<input type="hidden" name="thnang" value="">
... <select name="propinsi" style="width:150" onChange="enableIt(this,document.sipp.proyektemp); gatherInfoProp(document.sipp.fiscal, this, 'thn=','&kdprop='); getName(this,document.sipp.nmpropinsi);">
<option value="0">Pilih Propinsi</option>
<option value="01" selected>DKI Jakarta </option>
<option value="02">Jawa Barat </option>
<option value="03">Jawa Tengah </option>

And the links to the data table I need look like this. Note the dots in the HTML tags are just so this shows up looking right here:

<..tr>
<..td class="namaForm">Form A-3<../td>
<..td class="content" onMouseOver="this.bgColor='#EAEAEA'" onMouseOut="this.bgColor='#FFFFFF'">
<..a href="#" onClick="tohtm('../sipp2005/form_A3.php')">Laporan Paket Kontrak<../a>
<../td>
<..td align="center" class="clickableTXT"><img src="../images/xls.gif" alt="Simpan ke file Excel dan Print" width="20" height="20" onClick="toxls('../sipp2005/form_A3.php')"><../td>
<../tr>

Finally, here's some of my code. This comes after I've already logged in and followed a link the page where the HTML above comes from.

$br->get('http://.../sipp.php?thn=2008&kdprop=01'); # $br is initialized from mechanize: my $br = WWW::Mechanize->new(); # Set ->agent_alias('Windows IE 6'); my $resp = $br->content(); $resp =~ s/\x0D//g; # On a mac here, get ^M at the end of each line my @pt_vals = get_proyektemp_values($resp); #Don't want to use mech-dump, so just regexing the newly populated val +ues of the 3rd drop down menu $br->form_name('sipp'); $br->field('proyektemp', "$pt_vals[1]"); $br->submit();
Then I've tried both of the following.
#1 my $link_resp = $br->follow_link(text_regex => qr/paket\s+kontrak/i); #2 $br->get('.../sipp2005/form_A3.php');

P1 just brings me back to the same page I started on. 2 gets me to the data table page, but with an (incorrectly) empty table. Am I just being a newbie web programmer idiot? Is this a Javascript problem? Are these select controls and links all calling javascript functions, which aren't interpreted in Mechanize? Are there other libraries that would scrape this page successfully? I've also tried the Python version of Mechanize, but had no success there either.

Thanks, Matt

In reply to Mechanize, Forms, Links, problem from Javascript? by goodepic

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found