LWP::Simple::get($url) does not work for some urls

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I can't figure out why it works for url #1 and #3 but does not work for #2. Just returns an empty string.

#!/usr/bin/perl

use strict;
use warnings;

my $url;

#1 $url = "http://www.perlmonks.org";
#2 $url = "http://en.wikipedia.org/wiki/Hotel";
#3 $url = "http://search.yahoo.com/search?p=hotel&fr=yfp-t-103&toggle=
+1&cop=mss&ei=UTF-8";

use LWP::Simple;

my $str = LWP::Simple::get($url);

#----------------------
#print "$str\n";
#----------------------
[download]

Comment on LWP::Simple::get($url) does not work for some urls Download Code

Replies are listed 'Best First'.
Re: LWP::Simple::get($url) does not work for some urls by Gangabass (Vicar) on Jul 07, 2008 at 00:00 UTC
I think this is the problem (in server response to your request): Read more... (997 Bytes) As you can see it uses gziped response so you must have something which unzip it for you.	[reply]
Re^2: LWP::Simple::get($url) does not work for some urls by tinita (Parson) on Jul 07, 2008 at 12:56 UTC
As you can see it uses gziped response so you must have something which unzip it for you. usually the server sends its content gzipped only if the client says it can handle it. So if you don't send an Accept-Encoding: gzip, you will get the plain content.	[reply]
Re: LWP::Simple::get($url) does not work for some urls by Your Mother (Archbishop) on Jul 06, 2008 at 23:53 UTC
Update: I think I was totally off and Gangabass is correct below. Looks like LWP is on their block list. `curl` appears to be too. If you really need to get something automatically you can set a user agent that isn't blocked. This is just one way: `use WWW::Mechanize; my $mech = WWW::Mechanize->new(agent => "NotBlocked/0.01"); my $url = "http://en.wikipedia.org/wiki/Hotel"; $mech->get($url); print $mech->content;` [download] But they appear to be blocking automatic access intentionally so I suspect it's a TOS violation (this is just a guess). They distribute the DB (I have no idea how but I'm sure they have good docs on it) so you can surely get direct loads of what you're after while respecting their policies.	[reply] [d/l] [select]
Re: LWP::Simple::get($url) does not work for some urls by Lawliet (Curate) on Jul 06, 2008 at 23:55 UTC
Hmm. Did you try the same using LWP::UserAgent? I think that may fix the problem. Is there a specific reason you need to use Simple? Update: After a quick test, `use LWP::UserAgent; my $ua = LWP::UserAgent->new(); my $req = new HTTP::Request GET => 'http://en.wikipedia.org/wiki/Hotel +'; my $res = $ua->request($req); my $content = $res->content; #---------------------- print "$content\n"; #----------------------` [download] Seems to work fine.	[reply] [d/l]
Re: LWP::Simple::get($url) does not work for some urls by Khen1950fx (Canon) on Jul 07, 2008 at 07:04 UTC
I think that LWP::Simple isn't picking-up on XHTML. What's happening is that Wikipedia is using a wiki markup dialect called MediaWiki, and the best way that I have found to get that url is to use HTML::WikiConverter. Try this: `#!/usr/bin/perl use strict; use warnings; use HTML::WikiConverter; my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' ); print $wc->html2wiki( uri => 'http://en.wikipedia.org/wiki/Hotel' ), " +\n";` [download] Updated: fixed typo	[reply] [d/l]
Re^2: LWP::Simple::get($url) does not work for some urls by moritz (Cardinal) on Jul 07, 2008 at 08:59 UTC
Maybe I misunderstood your reply, but I'm quite sure that LWP::Simple is ignorant to the content-type and just returns whatever the server sends. It doesn't complain when the content-type is xhtml and not html.	[reply]
Re^3: LWP::Simple::get($url) does not work for some urls by Khen1950fx (Canon) on Jul 07, 2008 at 20:39 UTC
I agree. The problem isn't with LWP::Simple, but with Wikipedia. Sorry if there was a misunderstanding.	[reply]
Re: LWP::Simple::get($url) does not work for some urls by Gangabass (Vicar) on Jul 08, 2008 at 00:46 UTC
Sorry, i was totally wrong. It's a UserAgent header problem. For some reasons Wikipedia doesn't like LWP::Simple default header. But you can change it! (thanks to Dog and Pony for Getting more out of LWP::Simple) `use strict; use warnings; use LWP::Simple qw($ua get); $ua->agent('My agent/1.0'); my $url = "http://en.wikipedia.org/wiki/Hotel"; my $html = get $url \|\| die "Timed out!"; print $html;` [download]	[reply] [d/l]
Re: LWP::Simple::get($url) does not work for some urls by linuxer (Curate) on Jul 07, 2008 at 20:03 UTC
Just did a test: `$ HEAD http://en.wikipedia.org/wiki/Hotel \| head -1 403 Forbidden $ HEAD http://www.perlmonks.org \| head -1 200 OK $ HEAD "http://search.yahoo.com/search?p=hotel&fr=yfp-t-103&toggle=1&c +op=mss&ei=UTF-8?" \| head -1 200 OK $` [download] `HEAD` is a perl skript provided by libwww-perl. While `wget` (perl independent package) results in: `$ wget -S -O /dev/null http://en.wikipedia.org/wiki/Hotel 2>&1 \| head + -5 \| tail -1 HTTP/1.0 200 OK` [download] So it looks like they don't like perl (or whatever data is sent by HEAD) at wikipedia...	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom