Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Chinese site and decoded_content() trouble

by valdez (Monsignor)
on Jun 09, 2007 at 17:54 UTC ( #620206=note: print w/ replies, xml ) Need Help??


in reply to Chinese site and decoded_content() trouble

Ni hao :) It seems that the page you are requesting cannot be decoded properly; in fact if you add the raise_error parameter to decoded_content, you get an error. I used the following test program:

#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use Encode qw/ is_utf8 /; my $agent = LWP::UserAgent->new(); my @tests = ( 'http://cn.life.dada.net/people/', 'http://www.sina.com.cn/', 'http://www.ku6.com/show/34D6sgY4X6w3YegR.html', 'http://www.xinhua.cn/', ); foreach my $uri (@tests) { eval { printf "test: %s\n", $uri; my $response = $agent->get($uri); my $dc = $response->decoded_content( raise_error => 1 ); printf "is decoded content utf8? %s\n", is_utf8($dc); }; if ($@) { print "decode failed: $@\n"; } print "\n"; }
You could try to force the charset used by adding charset parameter.

Ciao, Valerio


Comment on Re: Chinese site and decoded_content() trouble
Download Code
Re^2: Chinese site and decoded_content() trouble
by isync (Hermit) on Jun 11, 2007 at 08:50 UTC
    Thanks for your comment! And thank you for pointing out the raise_error switch - which I wasn't aware of (rtfm...)!

    Running your test script gives me these errors:
    test: http://cn.life.dada.net/people/ is decoded content utf8? 1 test: http://www.sina.com.cn/ decode failed: euc-cn "\x82" does not map to Unicode at /usr/lib/perl/ +5.8/Encode.pm line 166. at test_ku6.pl line 22 test: http://www.ku6.com/show/34D6sgY4X6w3YegR.html decode failed: utf8 "\xE6" does not map to Unicode at /usr/lib/perl/5. +8/Encode.pm line 166. at test_ku6.pl line 22 test: http://www.xinhua.cn/ decode failed: euc-cn "\xA9" does not map to Unicode at /usr/lib/perl/ +5.8/Encode.pm line 166. at test_ku6.pl line 22
    which points me at the problem that decoded_content uses the wrong charset for decoding. (right?)

    A search on the issue led me to this helpful thread http://www.issociate.de/board/post/400895/Fetching_the_charset_when_set_in_meta..html and the non-cpan module HTTP::Response::Charset (which I haven't tried so far).

    So I am stuck here with three options:
    1. use Encode::Guess
    2. use HTML::Encoding
    3. use HTTP::Response::Charset

    A quick test seems to show that HTML::Encoding is the most reliable (and much less a hassle to install than Encode::Detect). Now I will try
    my $enco = encoding_from_http_message($resp); my $utf8 = decode($enco => $resp->content);
    But how do I combine it elegantly with LWP? (I am not a great module guru..)
    Is there a way to pull it into the LWP context and get, let's say, $response->as_utf8 (which is the result of the above)?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://620206]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2015-07-02 01:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (25 votes), past polls