Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: mech follow_link question

by Anonymous Monk
on Feb 19, 2012 at 05:46 UTC ( [id://954838]=note: print w/replies, xml ) Need Help??


in reply to mech follow_link question

it prepends the directory on my hard drive that I'm running the perl script from.

Don't think so

Replies are listed 'Best First'.
Re^2: mech follow_link question
by zingbust (Initiate) on Feb 19, 2012 at 15:23 UTC
    sorry, I didn't obey the rule here, I'll try again by inserting the proper formatting for these posts.
    $m = WWW::Mechanize->new(); $m->get($url); # $url is some home page @links = $m->links(); for $link ( @links ) { &follow; } sub follow { if ($m->follow_link( url_regex => qr/contact/i)){ print "$link->url\n"; } }
    I tried the "follow" subroutine, thinking that if the if statement was false, it would return undef, but instead, the first link it tried to follow, which did NOT contain the string "contact" just crashed the program with the error message "Link not found at c:\websites\bla_bla\my_perl_script.pl". Why would mech assume the relative link was something off my hard drive instead of from the first fetched page????

      There are two things at play:

      First Link not found at c:\websites\bla_bla\my_perl_script.pl is just the error message by Perl, which tells you the line number where the error was raised. You left off the line number, but it is likely the number of the line in the subroutine follow().

      The second thing is, WWW::Mechanize behaves like a browser. If you issue ->follow_link for one link, all other links you may have collected will likely be not valid anymore, as they are not on the other page. Consider dumping the ->content for each page. Maybe you want to go ->back after visiting every page in turn?

      As a last point, your style of using the &follow; syntax mixed with global variables is discomforting. I would rewrite that snippet as:

      for my $link ( @links ) { follow( $link ); }; sub follow { my ($link) = @_; warn "Following contact link; if ($m->follow_link( url_regex => qr/contact/i)){ print $link->url."\n"; } };

      As for your thoughts about how WWW::Mechanize works, and what the subroutines return, please read WWW::Mechanize. Most things are fatal to make it easier for you to spot when your assumptions deviate from the reality of the website you're automating.

        Thank you very much for your insights. It should have been so obvious to me that my program crashed and gave the line in the script of where it crashed as part of the error message rather than mech actually trying to follow something on my hard drive. Guess I'm just tired and old. It still bugs me though that something like
        $m->follow_link( url_regex => qr/contact/i)){
        completely ignores the "url_regex => qr/contact/i" part and follows EVERY link instead of just the ones that may have the word "contact" in them. I have read and continue to read WWW::Mechanize all the time, but nothing there explains why this is so and what the purpose of the regex part is, if follow_link is going to ignore that part anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://954838]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-24 19:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found