Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Chinese site and decoded_content() trouble

by isync (Hermit)
on Jun 08, 2007 at 19:12 UTC ( [id://620068]=perlquestion: print w/replies, xml ) Need Help??

isync has asked for the wisdom of the Perl Monks concerning the following question:

utf-8 and lwp seems to be a constant struggle...
Why does this little script return an empty string? Anybody??
#!/usr/bin/perl use LWP::UserAgent; use utf8; my $agent = LWP::UserAgent->new(); my $response = $agent->get('http://www.ku6.com/show/34D6sgY4X6w3YegR.h +tml'); print $response->decoded_content ."\n";
Using content() works but gives (at least in console) garbled data. Doing decode("utf8", $response->content) looks like doubly decoding.

Some posts regarding decoded_content() on the usenet hint that decoded_content() might fail when the content isn't tagged properly with "encoding" etc. But strangely this page has a proper Content-Encoding header...

So why does it fail? And how do I get it to work?
Any help welcome!

Replies are listed 'Best First'.
Re: Chinese site and decoded_content() trouble
by graff (Chancellor) on Jun 09, 2007 at 04:06 UTC
    When I tried your script, I got this error message:
    Can't locate object method "decoded_content" via package "HTTP::Header +s"
    (update: actually, I tried it with both "decode_content" and "decoded_content" -- both yielded the same sort of error)

    But when I ran it with Data::Dumper and dumped the contents of $response, I could see that it had plenty of utf8 data with lots of Chinese characters.

    I even upgraded LWP::UserAgent from 2.024 to 2.033 (the current version as of this writing), but got the same error. Did you happen to get that error as well? (It would have been worthwhile to say so.)

    If I just use the method "content" (instead of "decoded_content"), I see a lot of page content. Did you try that? Is there some reason why the output of "content" isn't what you really want?

    Another update: I forgot to comment on this:

    Using content() works but gives (at least in console) garbled data. Doing decode("utf8", $response->content) looks like doubly decoding.

    Are you sure you are using a utf8-capable console, with an appropriate unicode font that includes Chinese characters? You might try this little unicode transliterator script -- run the original data through that (without decode('utf8',...)) to see if it really is garbled. (Doesn't look garbled at all in my macosx "Terminal" window -- but I know better than to try pushing through a traditional xterm.)

      Can't locate object method "decoded_content"
      The method decoded_content is not located in LWP but in HTTP::Message which is accessed indirectly via LWP. The method is available in version 1.57 of this module (see CPAN). Apparently it has been added more recently, my local version 1.42 does not yet provide this method.
      "Is there some reason why the output of "content" isn't what you really want?"

      --actually yes! I'd like to get clean utf8, which requires to first possibly unzip gzipped content and then decode it properly from a local charset to the more universal utf8 representation (and decoded_content should do this in one simple call).
Re: Chinese site and decoded_content() trouble
by valdez (Monsignor) on Jun 09, 2007 at 17:54 UTC

    Ni hao :) It seems that the page you are requesting cannot be decoded properly; in fact if you add the raise_error parameter to decoded_content, you get an error. I used the following test program:

    #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use Encode qw/ is_utf8 /; my $agent = LWP::UserAgent->new(); my @tests = ( 'http://cn.life.dada.net/people/', 'http://www.sina.com.cn/', 'http://www.ku6.com/show/34D6sgY4X6w3YegR.html', 'http://www.xinhua.cn/', ); foreach my $uri (@tests) { eval { printf "test: %s\n", $uri; my $response = $agent->get($uri); my $dc = $response->decoded_content( raise_error => 1 ); printf "is decoded content utf8? %s\n", is_utf8($dc); }; if ($@) { print "decode failed: $@\n"; } print "\n"; }
    You could try to force the charset used by adding charset parameter.

    Ciao, Valerio

      Thanks for your comment! And thank you for pointing out the raise_error switch - which I wasn't aware of (rtfm...)!

      Running your test script gives me these errors:
      test: http://cn.life.dada.net/people/ is decoded content utf8? 1 test: http://www.sina.com.cn/ decode failed: euc-cn "\x82" does not map to Unicode at /usr/lib/perl/ +5.8/Encode.pm line 166. at test_ku6.pl line 22 test: http://www.ku6.com/show/34D6sgY4X6w3YegR.html decode failed: utf8 "\xE6" does not map to Unicode at /usr/lib/perl/5. +8/Encode.pm line 166. at test_ku6.pl line 22 test: http://www.xinhua.cn/ decode failed: euc-cn "\xA9" does not map to Unicode at /usr/lib/perl/ +5.8/Encode.pm line 166. at test_ku6.pl line 22
      which points me at the problem that decoded_content uses the wrong charset for decoding. (right?)

      A search on the issue led me to this helpful thread http://www.issociate.de/board/post/400895/Fetching_the_charset_when_set_in_meta..html and the non-cpan module HTTP::Response::Charset (which I haven't tried so far).

      So I am stuck here with three options:
      1. use Encode::Guess
      2. use HTML::Encoding
      3. use HTTP::Response::Charset

      A quick test seems to show that HTML::Encoding is the most reliable (and much less a hassle to install than Encode::Detect). Now I will try
      my $enco = encoding_from_http_message($resp); my $utf8 = decode($enco => $resp->content);
      But how do I combine it elegantly with LWP? (I am not a great module guru..)
      Is there a way to pull it into the LWP context and get, let's say, $response->as_utf8 (which is the result of the above)?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://620068]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-20 13:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found