http://www.perlmonks.org?node_id=695886

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I can't figure out why it works for url #1 and #3 but does not work for #2. Just returns an empty string.
#!/usr/bin/perl use strict; use warnings; my $url; #1 $url = "http://www.perlmonks.org"; #2 $url = "http://en.wikipedia.org/wiki/Hotel"; #3 $url = "http://search.yahoo.com/search?p=hotel&fr=yfp-t-103&toggle= +1&cop=mss&ei=UTF-8"; use LWP::Simple; my $str = LWP::Simple::get($url); #---------------------- #print "$str\n"; #----------------------

Replies are listed 'Best First'.
Re: LWP::Simple::get($url) does not work for some urls
by Gangabass (Vicar) on Jul 07, 2008 at 00:00 UTC

    I think this is the problem (in server response to your request):

    As you can see it uses gziped response so you must have something which unzip it for you.

      As you can see it uses gziped response so you must have something which unzip it for you.
      usually the server sends its content gzipped only if the client says it can handle it. So if you don't send an Accept-Encoding: gzip, you will get the plain content.
Re: LWP::Simple::get($url) does not work for some urls
by Your Mother (Archbishop) on Jul 06, 2008 at 23:53 UTC

    Update: I think I was totally off and Gangabass is correct below.

    Looks like LWP is on their block list. curl appears to be too. If you really need to get something automatically you can set a user agent that isn't blocked. This is just one way:

    use WWW::Mechanize; my $mech = WWW::Mechanize->new(agent => "NotBlocked/0.01"); my $url = "http://en.wikipedia.org/wiki/Hotel"; $mech->get($url); print $mech->content;

    But they appear to be blocking automatic access intentionally so I suspect it's a TOS violation (this is just a guess). They distribute the DB (I have no idea how but I'm sure they have good docs on it) so you can surely get direct loads of what you're after while respecting their policies.

Re: LWP::Simple::get($url) does not work for some urls
by Lawliet (Curate) on Jul 06, 2008 at 23:55 UTC
    Hmm. Did you try the same using LWP::UserAgent? I think that may fix the problem.

    Is there a specific reason you need to use Simple?

    Update: After a quick test,

    use LWP::UserAgent; my $ua = LWP::UserAgent->new(); my $req = new HTTP::Request GET => 'http://en.wikipedia.org/wiki/Hotel +'; my $res = $ua->request($req); my $content = $res->content; #---------------------- print "$content\n"; #----------------------

    Seems to work fine.
Re: LWP::Simple::get($url) does not work for some urls
by Khen1950fx (Canon) on Jul 07, 2008 at 07:04 UTC
    I think that LWP::Simple isn't picking-up on XHTML. What's happening is that Wikipedia is using a wiki markup dialect called MediaWiki, and the best way that I have found to get that url is to use HTML::WikiConverter. Try this:

    #!/usr/bin/perl use strict; use warnings; use HTML::WikiConverter; my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' ); print $wc->html2wiki( uri => 'http://en.wikipedia.org/wiki/Hotel' ), " +\n";
    Updated: fixed typo
      Maybe I misunderstood your reply, but I'm quite sure that LWP::Simple is ignorant to the content-type and just returns whatever the server sends. It doesn't complain when the content-type is xhtml and not html.
        I agree. The problem isn't with LWP::Simple, but with Wikipedia. Sorry if there was a misunderstanding.
Re: LWP::Simple::get($url) does not work for some urls
by Gangabass (Vicar) on Jul 08, 2008 at 00:46 UTC

    Sorry, i was totally wrong. It's a UserAgent header problem. For some reasons Wikipedia doesn't like LWP::Simple default header. But you can change it! (thanks to Dog and Pony for Getting more out of LWP::Simple)

    use strict; use warnings; use LWP::Simple qw($ua get); $ua->agent('My agent/1.0'); my $url = "http://en.wikipedia.org/wiki/Hotel"; my $html = get $url || die "Timed out!"; print $html;
Re: LWP::Simple::get($url) does not work for some urls
by linuxer (Curate) on Jul 07, 2008 at 20:03 UTC

    Just did a test:

    $ HEAD http://en.wikipedia.org/wiki/Hotel | head -1 403 Forbidden $ HEAD http://www.perlmonks.org | head -1 200 OK $ HEAD "http://search.yahoo.com/search?p=hotel&fr=yfp-t-103&toggle=1&c +op=mss&ei=UTF-8?" | head -1 200 OK $
    HEAD is a perl skript provided by libwww-perl.

    While wget (perl independent package) results in:

    $ wget -S -O /dev/null http://en.wikipedia.org/wiki/Hotel 2>&1 | head + -5 | tail -1 HTTP/1.0 200 OK

    So it looks like they don't like perl (or whatever data is sent by HEAD) at wikipedia...