Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Why is utf8 flag set after Encode::decode of pure ASCII?

by brycen (Monk)
on Mar 29, 2010 at 16:58 UTC ( #831664=perlquestion: print w/replies, xml ) Need Help??
brycen has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, My gods are in disagreement, who can I trust? The Encode page says:
After $utf8 = decode('foo', $octet); , When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF In ISO-8859-1 ON In any other Encoding ON ---------------------------------------------
Yet at the altar of Perl 5.10.0 and Perl 5.8:
#use utf8; #use encoding 'iso-8859-1'; #use encoding 'utf-8-strict'; #use encoding::warnings; #check for implicit upgrade use Encode; print '${^UNICODE}='.${^UNICODE}."\n\n"; $a = "face"; $b = "not_a_face=\x{e2}\x{98}\x{ba}"; print "a=",Encode::is_utf8($a)," b=",Encode::is_utf8($b),"\n\n"; $a=Encode::decode('iso-8859-1',$a); $b=Encode::decode('iso-8859-1',$b); print "a=",Encode::is_utf8($a)," b=",Encode::is_utf8($b),"\n\n";
${^UNICODE}=63 a= b= a=1 b=1
Showing that a pure ASCII string has the utf8 flag set. My faith in Perl Unicode is tested. Monks can you help me see the light?

Bryce Nesbitt, Berkeley Electronic Press, Berkeley CA

Replies are listed 'Best First'.
Re: Why is utf8 flag set after Encode::decode of pure ASCII?
by ikegami (Pope) on Mar 29, 2010 at 17:14 UTC

    The documentation is incorrect, although the only difference is a potential performance decline. Feel free to file a bug report.

    Update: When I said the documentation is incorrect, I just meant the documentation and the code aren't in sync. I didn't mean to voice an opinion on whether the code or the documentation needs to be changed. I'm favouring a code change.

Re: Why is utf8 flag set after Encode::decode of pure ASCII?
by moritz (Cardinal) on Mar 29, 2010 at 18:34 UTC
    The UTF-8 flag is meant to be purely internal - asking about its value in your Perl code and making decisions based on that information is bound to get you in trouble.

    Still if the documentation says it should be off, it should probably be updated.

    Perl 6 - links to (nearly) everything that is Perl 6.
      Many people give such advice: ignore perl's internal encoding. Fine advice, at least in production. But Perl makes so many magic behind-the-scenes Unicode conversions, one often needs to look at this flag in order to understand what end is up during development. Grr.

        It's not just Perl -- it's also CPAN modules, particularly XS modules. And I agree -- it's foolhardy to pretend that the SVf_UTF8 flag doesn't exist. It's almost impossible to troubleshoot UTF-8 problems in a large system without snooping it. The system is prone to silent failure, and when something goes wrong and you need to track down where the silent failure originates, you need to look at that flag.

        (Die $YAML::Syck::ImplicitUnicode, die die die.)

        Agreed, but it's the only sane advice you will get.

        The reasons you have to look at the utf8 flag sometimes is because some of the code (mostly CPAN modules) do not use the provided sane advice.

        If you want to read/write text in a portable manner, or convert between text and binary (integer) representation of characters, you have to specify what encoding you're expecting. If you don't, your code will only reliably work on 7bit ASCII text. And that'll only work on most platforms. That's the executive summary, and that's really all there is to it.

Re: Why is utf8 flag set after Encode::decode of pure ASCII?
by creamygoodness (Curate) on Mar 29, 2010 at 19:02 UTC

    ASCII strings may follow different paths through the code depending on whether the SVf_UTF8 flag is set, but the end results should be exactly the same. That makes it hard to maintain discipline as to whether the flag should be on or off, and in practice, you can't count on it being one way or the other.

    If you have an all-Unicode application or subsystem, sometimes it makes sense to convert the string to an internal UTF8 representation at the boundary as it enters the subsystem, so that you don't have to continually run UTF-8 byte sequence validity checks to see whether the scalar is pure ASCII or contains high 8-byte code points. The easy way to do this is to turn the SVf_UTF8 flag on even if it's an ASCII string. One of my XS distros does this.

Re: Why is utf8 flag set after Encode::decode of pure ASCII?
by brycen (Monk) on Mar 30, 2010 at 01:18 UTC
Re: Why is utf8 flag set after Encode::decode of pure ASCII?
by mrajcok (Initiate) on Mar 30, 2010 at 19:26 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://831664]
Front-paged by keszler
and the grasshoppers chirp...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2017-03-23 02:50 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (281 votes). Check out past polls.