Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How to scraper ASP websites

by Anonymous Monk
on Sep 05, 2012 at 04:59 UTC ( #991737=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi, I've one of my asp websites and I want to run a daily scrapper to keep a track of all data.

Issue is that the site is in asp and on fetching the contents the contents is something in coded form. I tried WWW::Mechanize but the same output in javascript and encoded form.

I tried WWWW::Mechanize::Firefox but it was not even able to login to the website, it simply says no such fields found

Then I go with HTML::TreeBuilderX::ASP_NET, but I could not understand exactly how to use it.

Even in mechanize I tried to post the __EVENTVALIDATION and __VIEWSTATEbut again No Luck there too. Can anyone help me in this, posting here is the sample WWW::Mechanize program:
#!/usr/bin/perl use WWW::Mechanize; my $mech = WWW::Mechanize->new(autocheck => 0, autodie => 0); $mech->agent('wonderbot/JS 1.0'); $mech->cookie_jar(HTTP::Cookies->new()); $mech->get('https://XXXXX.aspx'); my $email = 'XXXX'; my $password = 'XXXX'; my $org_id = 'XXXXXX'; $mech->form_id('loginForm'); $mech->field('sUserName' => $email); $mech->field('sPassword' => $password); $mech->field('sParentUID' => $org_id); #$mech->field('__EVENTVALIDATION' => $__EVENTVALIDATION); #$mech->field('__VIEWSTATE' => $__VIEWSTATE); $mech->click(); $mech->get('https://YYYY.aspx');#going to another link on the same web +site my $html_string = $mech->content(); print $html_string;
Code for WWW::Mechanize::Firefox:
#!/usr/bin/perl use WWW::Mechanize; my $mech = WWW::Mechanize->new(autocheck => 0, autodie => 0); $mech->agent('wonderbot/JS 1.0'); $mech->cookie_jar(HTTP::Cookies->new()); $mech->get('https://XXXXX.aspx'); my $email = 'XXXX'; my $password = 'XXXX'; my $org_id = 'XXXXXX'; $mech->form_id('loginForm'); $mech->field('sUserName' => $email); $mech->field('sPassword' => $password); $mech->field('sParentUID' => $org_id); #$mech->field('__EVENTVALIDATION' => $__EVENTVALIDATION); #$mech->field('__VIEWSTATE' => $__VIEWSTATE); $mech->click(); $mech->get('https://YYYY.aspx');#going to another link on the same web +site my $html_string = $mech->content(); print $html_string;
Any help in this would be deeply regarded. Thanks

Comment on How to scraper ASP websites
Select or Download Code
Re: How to scraper ASP websites
by Corion (Pope) on Sep 05, 2012 at 05:28 UTC
    You don't show any code using WWW::Mechanize::Firefox. In theory, Firefox should find the fields. What might help is if you show the actual relevant code and (only!) the relevant HTML that declares the fields, together with the error message you get.
Re: How to scraper ASP websites
by Gangabass (Priest) on Sep 05, 2012 at 15:04 UTC

    Always use strict; and use warnings;. You don't define $__EVENTVALIDATION and $__VIEWSTATE that's why your code send empty values to the target site.

    These values are located in the HTML form you receive from the target site and WWW::Mechanize will send these values automatically for you.

    So usually you just need to provide login/password (and ParentUID in your example) to process auth page. May be you need to provide button name for the click()... Or may be this is some kind of Javascript magic (try to login in your browser with Javascript disabled)...

    Also I'm sure that WWW::Mechanize::Firefox will work for such site (of course you must don't touch __VIEWSTATE and __VIEWSTATE).

      #!/usr/bin/perl use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my $url = ('http://www.folkeferie.dk/da/ferier/Aktuelle-chartertilbud- +--afbudsrejser/'); $mech->get($url); my $hsh={}; $links = $mech->find_all_links(url_regex=>qr/templates\/textPage\.aspx +\?id/i, text_regex=>qr/Afbudsrejser/i); foreach my $link (@$links) { $url = $link->url_abs(); $mech->get($url); my $content = $mech->content(); while ($content=~/tr class="bgrow1"><td>(.*?)<\/td><td cla +ss="countryValue">(.*?)<\/td><td class="destnameValue">(.*?)<\/td><td + class="hotelNameValue">(.*?)<\/td><td class="durationValue">(.*?)<\/ +td><td align="RIGHT" class="priceValue"><a target="_blank" href="(.*? +)">(.*?)<\/a><\/td>/gisxm) { $hsh->{'url'} = $6; $hsh->{'crap_id'} = ''; $hsh->{'date'} = $1; $hsh->{'country'} = $2; $hsh->{'destination'} = $3; $hsh->{'trip_type'} = $4; $hsh->{'trip_length'} = $5; $hsh->{'price'}=$7; print "$hsh->{'date'}, $hsh->{'country'}, $hsh->{'destina +tion'}, $hsh->{'trip_type'}, $hsh->{'trip_length'}, $hsh->{'price'}, +$hsh->{'crap_id'}, $hsh->{'url'}, $airport\n\n"; } }
      Please have a look in this code and also check the link and tell me how can I scrape the details from here. Regards

        This is less likely to get help than the node you messily copied and pasted it from.

        My recommendation is to think up an actual programming question relating to the code you are presenting. Something along the lines of:

        I'm trying to scrape a website. The following minimal code snippet is failing to produce the output I was expecting. I was expecting xyz, but instead I'm getting abc, plus an explosion of shards of solidified lava. I think the problem is with the pdq statement, but when I tried lmnop I got hot molten lava instead. How should I rewrite the thingamagizzer so that it would produce xyz rather than abc and hot lava?

        (Fill in the variables and problem description as necessary to reflect the current situation)


        Dave

        Please check the code:
        use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my @urls = ('http://www.folkeferie.dk/da/ferier/Aktuelle-chartertilbud +---afbudsrejser/'); foreach my $url (@urls) { $mech->get($url); my $hsh={}; $links = $mech->find_all_links(url_regex=>qr/templates\/textPage\. +aspx\?id/i, text_regex=>qr/Afbudsrejser/i); foreach my $link (@$links) { $url = $link->url_abs(); print "\n\n\n".$url."\n\n"; $mech->get($url); my $content = $mech->content(); print $content; while ($content=~/tr class="bgrow1"><td>(.*?)<\/td><td clas +s="countryValue">(.*?)<\/td><td class="destnameValue">(.*?)<\/td><td +class="hotelNameValue">(.*?)<\/td><td class="durationValue">(.*?)<\/t +d><td align="RIGHT" class="priceValue"><a target="_blank" href="(.*?) +">(.*?)<\/a><\/td>/gisxm) { $hsh->{'url'} = $6; $hsh->{'crap_id'} = ''; $hsh->{'date'} = $1; $hsh->{'country'} = $2; $hsh->{'destination'} = $3; $hsh->{'trip_type'} = $4; $hsh->{'trip_length'} = $5; $hsh->{'price'}=$7; print "$hsh->{'date'}, $hsh->{'country'}, $hsh->{'destinat +ion'}, $hsh->{'trip_type'}, $hsh->{'trip_length'}, $hsh->{'price'}, $ +hsh->{'crap_id'}, $hsh->{'url'}, $airport\n\n"; } } }
        The site is developed in asp , so the source contents are not exact HTML format. That's why I am facing lots of problem in fetching data from this site.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://991737]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2014-08-21 22:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (144 votes), past polls