Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Test::WWW::Mechanize page_links_ok fails on wikipedia entry external links

by mandog (Curate)
on Feb 05, 2009 at 14:01 UTC ( #741545=perlquestion: print w/ replies, xml ) Need Help??
mandog has asked for the wisdom of the Perl Monks concerning the following question:

I'm enjoying Test::WWW::Mechanize with one quirk: It fails links to pages on wikipedia

A link to the main wikipedia page is ok.

There is no javascript in the page.

The link works in firefox, konqueror and wget. Per the WWW::Mechanize faq I've struggled to find the difference between the browsers and Mech

The only slight clue is that wget -vS shows that wikipedia is behind a squid caching proxy. --But this is also the case for the front page which works.

To get on w/ my life, I'm linking to gnu.org but still would appreciate any help in resolving this mystery.

Below is minimal but complete Perl / html to reproduce the problem

#!/usr/bin/perl use strict; use warnings; use Test::WWW::Mechanize; use Test::More tests=>2; my $mech=Test::WWW::Mechanize->new( "stack_depth" => 10, 'timeout' => +60 ); $mech->get_ok('http://localhost/test.html'); # fails $mech->page_links_ok(); __DATA__ <!-- http://localhost/test.html --> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title> test </title> </head> <body> <p> <a href="http://en.wikipedia.org/wiki/Affero_General_Public_Lice +nse"> fails </a> </p> <p> <a href="http://thecsl.org">works</a> </p> </p> <a href-="http://wikipedia.org/">works</a> </p> <p> <a href="http://en.wikipedia.org/wiki/Mode_Gakuen_Cocoon_Tower +">fails</a> </p> <p> <a href="http://www.gnu.org/licenses/agpl.html">?</a> </p> </body> </html>

Comment on Test::WWW::Mechanize page_links_ok fails on wikipedia entry external links
Select or Download Code
Re: Test::WWW::Mechanize page_links_ok fails on wikipedia entry external links
by Corion (Pope) on Feb 05, 2009 at 14:06 UTC

    As a hint when testing things with LWP::UserAgent or WWW::Mechanize, they both accept file:// URLs as well, so you don't need a webserver to serve simple HTML pages.

Re: Test::WWW::Mechanize page_links_ok fails on wikipedia entry external links
by planetscape (Canon) on Feb 05, 2009 at 19:15 UTC

    You might experiment with different user agent settings; and/or view Wikipedia's robots.txt for clues.

    HTH,

    planetscape

      Yep, robots.txt / user-agent exclusion is the problem

      $mech->agent_alias( 'Windows IE 6' ); works with wikipedia but for some reason not gnu.org  $mech->agent_alias('Linux Mozilla'); works for both.

      I guess if wikipedia doesn't want mech scraping, I won't do it.

      Thanks for your help planetscape,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://741545]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (9)
As of 2015-07-02 05:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (29 votes), past polls