Why is utf8 flag set after Encode::decode of pure ASCII?

brycen has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, My gods are in disagreement, who can I trust? The Encode page http://perldoc.perl.org/Encode.html#The-UTF8-flag says:

After $utf8 = decode('foo', $octet); ,

  When $octet is...   The UTF8 flag in $utf8 is
  ---------------------------------------------
  In ASCII only (or EBCDIC only)            OFF
  In ISO-8859-1                              ON
  In any other Encoding                      ON
  ---------------------------------------------
[download]

Yet at the altar of Perl 5.10.0 and Perl 5.8:

#use utf8;
#use encoding 'iso-8859-1';
#use encoding 'utf-8-strict';
#use encoding::warnings;  #check for implicit upgrade
use Encode;

print '${^UNICODE}='.${^UNICODE}."\n\n";

$a = "face";
$b = "not_a_face=\x{e2}\x{98}\x{ba}";
print "a=",Encode::is_utf8($a)," b=",Encode::is_utf8($b),"\n\n";

$a=Encode::decode('iso-8859-1',$a);
$b=Encode::decode('iso-8859-1',$b);
print "a=",Encode::is_utf8($a)," b=",Encode::is_utf8($b),"\n\n";
[download]

Produces:

${^UNICODE}=63
a= b=
a=1 b=1
[download]

Showing that a pure ASCII string has the utf8 flag set. My faith in Perl Unicode is tested. Monks can you help me see the light?

Bryce Nesbitt, Berkeley Electronic Press, Berkeley CA

Comment on Why is utf8 flag set after Encode::decode of pure ASCII? Select or Download Code

Replies are listed 'Best First'.
Re: Why is utf8 flag set after Encode::decode of pure ASCII? by ikegami (Patriarch) on Mar 29, 2010 at 17:14 UTC
The documentation is incorrect, although the only difference is a potential performance decline. Feel free to file a bug report. Update: When I said the documentation is incorrect, I just meant the documentation and the code aren't in sync. I didn't mean to voice an opinion on whether the code or the documentation needs to be changed. I'm favouring a code change.	[reply]
Re: Why is utf8 flag set after Encode::decode of pure ASCII? by moritz (Cardinal) on Mar 29, 2010 at 18:34 UTC
The UTF-8 flag is meant to be purely internal - asking about its value in your Perl code and making decisions based on that information is bound to get you in trouble. Still if the documentation says it should be off, it should probably be updated. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Why is utf8 flag set after Encode::decode of pure ASCII? by brycen (Monk) on Mar 30, 2010 at 18:32 UTC
Many people give such advice: ignore perl's internal encoding. Fine advice, at least in production. But Perl makes so many magic behind-the-scenes Unicode conversions, one often needs to look at this flag in order to understand what end is up during development. Grr.	[reply]
Re^3: Why is utf8 flag set after Encode::decode of pure ASCII? by creamygoodness (Curate) on Mar 30, 2010 at 19:00 UTC
It's not just Perl -- it's also CPAN modules, particularly XS modules. And I agree -- it's foolhardy to pretend that the `SVf_UTF8` flag doesn't exist. It's almost impossible to troubleshoot UTF-8 problems in a large system without snooping it. The system is prone to silent failure, and when something goes wrong and you need to track down where the silent failure originates, you need to look at that flag. (Die `$YAML::Syck::ImplicitUnicode`, die die die.)	[reply] [d/l] [select]
Re^3: Why is utf8 flag set after Encode::decode of pure ASCII? by Joost (Canon) on Apr 01, 2010 at 00:36 UTC
Agreed, but it's the only sane advice you will get. The reasons you have to look at the utf8 flag sometimes is because some of the code (mostly CPAN modules) do not use the provided sane advice. If you want to read/write text in a portable manner, or convert between text and binary (integer) representation of characters, you have to specify what encoding you're expecting. If you don't, your code will only reliably work on 7bit ASCII text. And that'll only work on most platforms. That's the executive summary, and that's really all there is to it. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: Why is utf8 flag set after Encode::decode of pure ASCII? by creamygoodness (Curate) on Mar 29, 2010 at 19:02 UTC
ASCII strings may follow different paths through the code depending on whether the `SVf_UTF8` flag is set, but the end results should be exactly the same. That makes it hard to maintain discipline as to whether the flag should be on or off, and in practice, you can't count on it being one way or the other. If you have an all-Unicode application or subsystem, sometimes it makes sense to convert the string to an internal UTF8 representation at the boundary as it enters the subsystem, so that you don't have to continually run UTF-8 byte sequence validity checks to see whether the scalar is pure ASCII or contains high 8-byte code points. The easy way to do this is to turn the `SVf_UTF8` flag on even if it's an ASCII string. One of my XS distros does this.	[reply] [d/l] [select]
Re: Why is utf8 flag set after Encode::decode of pure ASCII? by brycen (Monk) on Mar 30, 2010 at 01:18 UTC
Ah, that would be THIS bug: https://rt.cpan.org/Public/Bug/Display.html?id=34259... Which is as much a comment on the attention paid to detail on the documentation as anything else.	[reply]
Re: Why is utf8 flag set after Encode::decode of pure ASCII? by mrajcok (Initiate) on Mar 30, 2010 at 19:26 UTC
Related thread: Behaviour of Encode::decode_utf8 on ASCII See also Encode vs core utf8::	[reply]

Back to Seekers of Perl Wisdom