Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Perl mechanize get Error!

by Anonymous Monk
on Dec 02, 2013 at 17:19 UTC ( #1065308=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Greetings Monks,

Below is my code, dont know why it is not working.

use strict; use WWW::Mechanize; my $url = "http://www.truro-penwith.ac.uk/"; my $mech = WWW::Mechanize->new(); print "\nURL: $url ...\n"; eval{ $mech->agent_alias('Windows Mozilla'); #$mech->add_header('User-Agent'=>'Mozilla/5.0 (Windows NT 6.1; WOW64; +rv:25.0) Gecko/20100101 Firefox/25.0'); #$mech->add_header('Accept'=>'text/html,application/xhtml+xml,applicat +ion/xml;q=0.9,*/*;q=0.8'); #$mech->add_header('Accept-Language'=>'en-US,en;q=0.5'); #$mech->add_header('Accept-Encoding'=>'gzip, deflate'); #$mech->add_header('Cookie'=>'bb2_screener_=1385998863+111.92.64.106; +PHPSESSID=078fc31740655a3a3f5fb280dbdf335d'); $mech->add_header('Connection'=>'keep-alive'); $mech->get($url); }; #$mech = $mech->content(); $mech = $mech->response->content(); print $mech; exit;

Anyone know what could be the proper reason.

Site is detecting this as a script, I tried adding headers with add_header & default_header, but nothing works. Response shows 400 Error and sometimes 403 Error. I wonder why this happened even though I had given the headers. Any ideas, I don't :(

Thanks in advance

Comment on Perl mechanize get Error!
Download Code
Re: Perl mechanize get Error!
by Anonymous Monk on Dec 02, 2013 at 17:25 UTC
    They don't want you to scrape the university website. Solution, don't.
      what an idiot you are... :D???!!!
Re: Perl mechanize get Error!
by PerlSufi (Friar) on Dec 02, 2013 at 20:32 UTC
    What is your goal with this script? I have written a brief tutorial on using mechanize that can be found here: WWW::Mechanize Basics
    If you need to do a lot of navigating on the site, I would recommend WWW::Mechanize::Firefox since it uses a lot of javascript. WWW::Mechanize and javascript don't get along too well. Also, try
    $mech->dump_text;
    I also recommend getting the firebug firefox extension and manually inspecting the page for each thing you want to access. For example, the url for 'Latest News' is http://www.truro-penwith.ac.uk/category/news/ which I determined by using the firebug extension..
    So to go there, just do
    $mech->get('http://www.truro-penwith.ac.uk/category/news/');
    UPDATE: Also, simply:
    my $mech = WWW::Mechanize->new(); $mech->get('http://www.truro-penwith.ac.uk/'); $mech->dump_text;
    worked for me.. you don't need to do anything with headers..
      Hi PerlSufi, You are great. Ok, Can you check this, https://thebigword-careers.irecruittotal.com/cac/SearchVacancy.aspx?EmploymentTypeID=0&Intranet=0 and give us a solution? Take it as a challenge. ;) Best Anonymous Monk
        I'm not really sure what the 'challenge' is? Do you want to be able to submit that form?
        use strict; use warnings; use WWW::Mechanize; #takes what vacancy to search as first argument on command line my $mech = WWW::Mechanize->new(); $mech->get("https://thebigword-careers.irecruittotal.com/cac/SearchVac +ancy.aspx?EmploymentTypeID=0&Intranet=0"); my $vacancy = $ARGV[0]; $mech->field( "ctl00$mvMintPP$ctl00$ContentPlaceHolder_Main$mvMintPP$ctl00$txbJobRef +", $vacancy); #(^^without plus sign occuring copied over) $mech->click_button(value => "Search Vacancies"); $mech->dump_text;
        ..might work..

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1065308]
Approved by Laurent_R
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2014-12-20 09:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls