Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Explicit charset confuses WWW::Mechanize and/or HTTP::Response

by vitoco (Hermit)
on May 15, 2009 at 15:52 UTC ( [id://764296]=note: print w/replies, xml ) Need Help??


in reply to Explicit charset confuses WWW::Mechanize and/or HTTP::Response

Finally, I determined that the HTML page is being sent by the webserver using iso-8859-1 (latin-1), not utf-8.

The problem is that <meta> tag is lying about the encoding of the page (it sais utf-8), and I think that HTTP::Response is decoding the contents based on it and WWW::Mechanize receives that corrupted data, that saves with a wide character warning.

As I cannot change anything from the remote server, how can I handle this? Is there a way to stop the automagic decoding done by modules and then process that data myself?

Thanks...

  • Comment on Re: Explicit charset confuses WWW::Mechanize and/or HTTP::Response
  • Download Code

Replies are listed 'Best First'.
Re^2: Explicit charset confuses WWW::Mechanize and/or HTTP::Response
by Polyglot (Chaplain) on May 16, 2009 at 14:49 UTC
    I am curious about something. Have you tried more than one browser? If your situation is as it seems to me, you may see the proper character handling in Firefox, but IE will fail. (I've been there, done that.) Firefox will read the HTML headers, and respect them. IE does not. IE reads from the initial output to the browser in the content headers. Therefore, you need to do this in your code, before printing anything else to the browser:

    print "Content-type: text/html; charset=utf-8\n\n"; #print CGI::header();

    In other words, the charset must be made utf-8 right from the first exchange to the browser and forward.

    Blessings,

    Polyglot

      Polyglot: in this case, IE displays the page OK, because the proper encoding is what HTTP header sais, as you predicted (the liar is the HTML header). I have not tried Firefox yet, but Opera also renders well those pages. I'm not sure if there is some automatic detection done by those browsers other than what is said both in the HTTP and/or HTML headers. Unfortunately, I can't touch the remote server's code, just live with it...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://764296]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-26 00:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found