Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Malformed UTF-8 character

by Steve_BZ (Hermit)
on Apr 29, 2011 at 19:53 UTC ( #902060=perlquestion: print w/ replies, xml ) Need Help??
Steve_BZ has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,

I seem to have a strange encoding bug when I use degrees C written like this: C.

I have a function t("C") which gives me an interpreter error:

Malformed UTF-8 character (unexpected continuation byte 0xb0, with no preceding start byte)

with double quotes. When I use single quotes, I get no interpreter error, but the function fails at the first database call. If I step through the code, the code hangs. If I execute it, I get a database error.

I'm using the statements:

use strict; use warnings; use utf8; use Encode; binmode STDOUT, ":utf8"; use open ':encoding(utf8)';

Any insight gratefully received.

Regards

Steve

Comment on Malformed UTF-8 character
Select or Download Code
Re: Malformed UTF-8 character
by moritz (Cardinal) on Apr 29, 2011 at 20:05 UTC

    Is your file stored in UTF-8?

    0xb0 is the representation of the degree symbol in Latin-1, but you told perl that your source file is UTF-8. Converting it to actually be UTF-8 should help.

      Thanks for this. What do you mean by actually convert it. And how can I tell if it's in UTF-8 or not. Does

      use utf8; binmode STDOUT, ":utf8"; use open ':encoding(utf8)';

      not do it?

      Do you mean what eff_i_g says and put that extra character () in so the function would read t("°C")?

      Sorry, I'm away from my PC at the moment or I'd test it myself.

      Regards

      Steve.

        It seems you don't really understand character encodings. Try reading this article get the basics.

        The line use utf8; tells Perl that your script is stored in UTF-8, but it is not. Your editor did not save it as UTF-8, but rather as another encoding, likely Latin-1.

        So either don't tell Perl that the file is stored in UTF-8 when it is not, or do store the file in UTF-8 (and use an editor which properly supports UTF-8).

Re: Malformed UTF-8 character
by eff_i_g (Curate) on Apr 29, 2011 at 20:07 UTC
    0xB0 itself ("") is invalid UTF-8; the proper encoding is 0xC2 0xB0 ("°"). Therefore, the file was not properly written to as UTF-8, and thus the error when you try to open it as such.

      Thanks for your help both the solutions described here worked. The "°" and the setting the file-encoding setting, which I guess achieved the same thing.

      Regards

      Steve

Re: Malformed UTF-8 character
by Eliya (Vicar) on Apr 29, 2011 at 20:16 UTC

    Others have explained why Perl complains — in case the string literal is declared with double quotes, at least.  In other words, your source is apparently not encoded in UTF-8, as you're telling Perl with the pragma use utf8.

    What I find more surprising is that Perl doesn't complain when - within the scope of use utf8 - the string literal (containing a Latin-1 encoded char like '') is declared using single quotes.  I'd say the latter is a bug (unless I've overlooked something in the docs... :)

    (I can replicate the issue here with 5.12.2.)

      Thanks for this. Do I have the use statements right?

      use utf8; use Encode; binmode STDOUT, ":utf8"; use open ':encoding(utf8)';

      I'm not sure quite what they do or what the difference is. I understand that use utf8 is to save the code page in utf-8, that use Encode is to provide a utility to encode and decode, but I'm not sure what binmode STDOUT, ":utf8"; or use open ':encoding(utf8)'; do?

      Regards

      Steve

        I understand that use utf8 is to save the code page in utf-8

        Not really sure what you mean by that... but use utf8 tells Perl that the source code (string literals, etc.) is encoded in UTF-8. So you shouldn't use it if that's not the case for your script.

        binmode STDOUT, ":utf8" sets the utf8 PerlIO layer for STDOUT, which tells Perl that you want UTF-8 encoded output for that file handle.

        use open ':encoding(utf8)' declares the default layer for I/O streams, i.e. you don't have to explicitly specify the respective layer when you open a file.  See the open pragma for the details.

      Eliya wrote:
      What I find more surprising is that Perl doesnt complain when within the scope of use utf8 the string literal (containing a Latin‑1 encoded char like '') is declared using single quotes. Id say the latter is a bug (unless I've overlooked something in the docs... :)

      I can confirm it still occurs in 5.14 RC0:

      % blead -C0 -le 'print qq(print "\xB0C";)' | blead -Mutf8 -CS -l Malformed UTF-8 character (unexpected continuation byte 0xb0, with no +preceding start byte) at - line 1. C % blead -C0 -le 'print qq(print \x27\xB0C\x27;)' | blead -Mutf8 -CS -l #C
      Oops.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://902060]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (9)
As of 2014-12-27 07:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls