Malformed UTF-8 character

Steve_BZ has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,

I seem to have a strange encoding bug when I use degrees C written like this: °C.

I have a function t("°C") which gives me an interpreter error:

Malformed UTF-8 character (unexpected continuation byte 0xb0, with no preceding start byte)

with double quotes. When I use single quotes, I get no interpreter error, but the function fails at the first database call. If I step through the code, the code hangs. If I execute it, I get a database error.

I'm using the statements:

use strict;
use warnings;
use utf8;
use Encode;
binmode STDOUT, ":utf8";
use open ':encoding(utf8)';
[download]

Any insight gratefully received.

Regards

Steve

Comment on Malformed UTF-8 character Select or Download Code

Replies are listed 'Best First'.
Re: Malformed UTF-8 character by moritz (Cardinal) on Apr 29, 2011 at 20:05 UTC
Is your file stored in UTF-8? 0xb0 is the representation of the degree symbol in Latin-1, but you told perl that your source file is UTF-8. Converting it to actually be UTF-8 should help. Perl 6 - second systems done right	[reply]
Re^2: Malformed UTF-8 character by Steve_BZ (Chaplain) on Apr 30, 2011 at 13:28 UTC
Thanks for this. What do you mean by actually convert it. And how can I tell if it's in UTF-8 or not. Does `use utf8; binmode STDOUT, ":utf8"; use open ':encoding(utf8)';` [download] not do it? Do you mean what eff_i_g says and put that extra character (Â) in so the function would read t("Â°C")? Sorry, I'm away from my PC at the moment or I'd test it myself. Regards Steve.	[reply] [d/l]
Re^3: Malformed UTF-8 character by moritz (Cardinal) on Apr 30, 2011 at 13:51 UTC
It seems you don't really understand character encodings. Try reading this article get the basics. The line `use utf8;` tells Perl that your script is stored in UTF-8, but it is not. Your editor did not save it as UTF-8, but rather as another encoding, likely Latin-1. So either don't tell Perl that the file is stored in UTF-8 when it is not, or do store the file in UTF-8 (and use an editor which properly supports UTF-8). Perl 6 - second systems done right	[reply] [d/l]
Re^4: Malformed UTF-8 character by Steve_BZ (Chaplain) on Apr 30, 2011 at 19:59 UTC
Re: Malformed UTF-8 character by eff_i_g (Curate) on Apr 29, 2011 at 20:07 UTC
0xB0 itself ("°") is invalid UTF-8; the proper encoding is 0xC2 0xB0 ("Â°"). Therefore, the file was not properly written to as UTF-8, and thus the error when you try to open it as such.	[reply]
Re^2: Malformed UTF-8 character by Steve_BZ (Chaplain) on Apr 30, 2011 at 20:02 UTC
Thanks for your help both the solutions described here worked. The "Â°" and the setting the file-encoding setting, which I guess achieved the same thing. Regards Steve	[reply]
Re: Malformed UTF-8 character by Eliya (Vicar) on Apr 29, 2011 at 20:16 UTC
Others have explained why Perl complains — in case the string literal is declared with double quotes, at least. In other words, your source is apparently not encoded in UTF-8, as you're telling Perl with the pragma `use utf8`. What I find more surprising is that Perl doesn't complain when - within the scope of `use utf8` - the string literal (containing a Latin-1 encoded char like '°') is declared using single quotes. I'd say the latter is a bug (unless I've overlooked something in the docs... :) (I can replicate the issue here with 5.12.2.)	[reply] [d/l] [select]
Re^2: Malformed UTF-8 character by tchrist (Pilgrim) on Apr 30, 2011 at 16:31 UTC
Eliya wrote: What I find more surprising is that Perl doesn’t complain when — within the scope of `use utf8` — the string literal (containing a Latin‑1 encoded char like '°') is declared using single quotes. I’d say the latter is a bug (unless I've overlooked something in the docs... :) I can confirm it still occurs in 5.14 RC0: `% blead -C0 -le 'print qq(print "\xB0C";)' \| blead -Mutf8 -CS -l Malformed UTF-8 character (unexpected continuation byte 0xb0, with no +preceding start byte) at - line 1. C % blead -C0 -le 'print qq(print \x27\xB0C\x27;)' \| blead -Mutf8 -CS -l #C` [download] Oops.	[reply] [d/l] [select]
Re^2: Malformed UTF-8 character by Steve_BZ (Chaplain) on Apr 30, 2011 at 13:23 UTC
Thanks for this. Do I have the use statements right? `use utf8; use Encode; binmode STDOUT, ":utf8"; use open ':encoding(utf8)';` [download] I'm not sure quite what they do or what the difference is. I understand that `use utf8` is to save the code page in utf-8, that `use Encode` is to provide a utility to encode and decode, but I'm not sure what `binmode STDOUT, ":utf8";`or `use open ':encoding(utf8)';` do? Regards Steve	[reply] [d/l] [select]
Re^3: Malformed UTF-8 character by Eliya (Vicar) on Apr 30, 2011 at 13:45 UTC
I understand that `use utf8` is to save the code page in utf-8 Not really sure what you mean by that... but `use utf8` tells Perl that the source code (string literals, etc.) is encoded in UTF-8. So you shouldn't use it if that's not the case for your script. `binmode STDOUT, ":utf8"` sets the utf8 PerlIO layer for STDOUT, which tells Perl that you want UTF-8 encoded output for that file handle. `use open ':encoding(utf8)'` declares the default layer for I/O streams, i.e. you don't have to explicitly specify the respective layer when you open a file. See the open pragma for the details.	[reply] [d/l] [select]
Re^4: Malformed UTF-8 character by Steve_BZ (Chaplain) on Apr 30, 2011 at 20:03 UTC


Pathologically Eclectic Rubbish Lister
	PerlMonks