Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Problem with join'ing utf8 and non-utf8 strings (bug?)

by rsmah (Scribe)
on Jun 17, 2008 at 17:09 UTC ( #692551=perlquestion: print w/replies, xml ) Need Help??

rsmah has asked for the wisdom of the Perl Monks concerning the following question:

I ran into a problem using XML::Simple generating output XML. The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.

I found this behavior very odd so I put together a test case that shows join corrupting a non-utf8 string when join'ed with another utf8 string.

At first I thought it might be decoding the non-utf8 string using the locale (or LANG or whatever) to some other encoding, but running this on a LANG=en_US.UTF-8 system produced the same results.

Can anyone explain to me what is going on?

Sample code:

no warnings 'utf8'; use Encode qw(decode is_utf8); $r = "\xc2\xa9\xc2\xae\xe2\x84\xa2"; print "Raw \$r : ", $r, " - ", (is_utf8($r)?"is":"is not"), " utf8\n"; $u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2"); print "UTF8 \$u : ", $u, " - ", (is_utf8($u)?"is":"is not"), " utf8\n"; $x = join('', $r, $u); print "Join(\$r, \$u): ", $x, " - ", (is_utf8($x)?"is":"is not"), " utf8\n"; $e = decode('utf8', $r); print "Encd \$e : ", $e, " - ", (is_utf8($e)?"is":"is not"), " utf8\n"; $y = join('', $e, $u); print "Join(\$e, \$u): ", $y, " - ", (is_utf8($y)?"is":"is not"), " utf8\n";
Sample Output:
Raw $r : - is not utf8 UTF8 $u : - is utf8 Join($r, $u): ©®� - is utf8 Encd $e : - is utf8 Join($e, $u): - is utf8

Replies are listed 'Best First'.
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by Juerd (Abbot) on Jun 17, 2008 at 18:23 UTC

    Hello dear Unicode newbie,

    You made one big mistake. Just one, so it's easy to fix. You assumed that you are supposed to look at the SvUTF8 flag, but you're not. It's an internal value, and because it's Perl you're allowed to look at its state. But you really shouldn't, if you want to keep your sanity.

    Don't use is_utf8, okay? If you really want to know about internal flags, please use Devel::Peek's Dump function instead. It will print some extra useful internal values too, such as the other flags in Perl like NOK and IOK. For that matter, pretend that the UTF8 flag's name is UOK.

    Better yet, pretend that the UTF8 flag does not exist. Perl just picks an encoding for numeric and string values automatically, and only in edge cases (and if you're dealing with internals or XS) you need to know what is going on.

    Read perlunitut and perlunifaq, and realise that you sometimes may need to use Unicode::Semantics (or utf8::upgrade) before text functions operate correctly.

    I think it's best if I don't explain what goes on in your code, and if you ignore explanations by others. Trying to understand what's going on internally is a nice exercise for when you know how to write good Unicode capable code, but not before that.

    Decode your input, and encode your output. Don't query or set the SvUTF8 flag. Thanks!

    Best regards,

    Juerd

Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by ikegami (Pope) on Jun 17, 2008 at 18:09 UTC

    Two mistakes.

    • The first is that you think that $r contains 3 characters.

      $r contains 7 characters or 7 bytes.
      $u contains 3 characters.
      So $x contains 10 (7+3) characters.
      When concatenated with characters (is_utf8 == true), bytes are treated as characters.

      $e contains 3 characters.
      so $y contains 6 (3+3) characters.

    • The second is that you think you're outputting UTF-8.

      You're outputting iso-latin-1 characters since you haven't said otherwise. You happen to mix in some UTF-8, but you silenced the message warning you of this problem.

      If you want to output something other than iso-latin-1, you do do so by using open (the pragma):

      use open qw( :std :locale );

    Update: Below is the fixed code (which was modified to output the length of the strings) and the output for a UTF-8 locale.

    use open qw( :std :locale ); use Encode qw(decode is_utf8); $r = "\xc2\xa9\xc2\xae\xe2\x84\xa2"; print "Raw \$r : ", sprintf('%2d', length($r)), " ", $r, " - ", (is_utf8($r)?"is":"is not"), " utf8\n"; $u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2"); print "UTF8 \$u : ", sprintf('%2d', length($u)), " ", $u, " - ", (is_utf8($u)?"is":"is not"), " utf8\n"; $x = join('', $r, $u); print "Join(\$r, \$u): ", sprintf('%2d', length($x)), " ", $x, " - ", (is_utf8($x)?"is":"is not"), " utf8\n"; $e = decode('utf8', $r); print "Encd \$e : ", sprintf('%2d', length($e)), " ", $e, " - ", (is_utf8($e)?"is":"is not"), " utf8\n"; $y = join('', $e, $u); print "Join(\$e, \$u): ", sprintf('%2d', length($y)), " ", $y, " - ", (is_utf8($y)?"is":"is not"), " utf8\n";
    Raw $r      :  7 ©®„ - is not utf8
    UTF8 $u     :  3  - is utf8
    Join($r, $u): 10 ©®„ - is utf8
    Encd $e     :  3  - is utf8
    Join($e, $u):  6  - is utf8
    
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by almut (Canon) on Jun 17, 2008 at 17:58 UTC

    I think it works as designed.  In other words, if you concat a unicode/character string with a non-unicode/byte string, the byte string will automatically be upgraded to unicode, with the non-ASCII values being interpreted as if they were in Latin-1 encoding — i.e. the first byte \xc2 ( in Latin-1) becomes Unicode- (which happens to be codepoint U+00C2) encoded as UTF-8 (i.e. the bytes \xc3\x82), the second byte \xa9 ( in Latin-1) becomes Unicode- (codepoint U+00A9) encoded as UTF-8 (i.e. the bytes \xc2\xa9), etc...

    Update: if you print a hexdump of your string $x, e.g.

    sub hexdump { my $s = shift; print join " ", unpack("(H2)*", $s), "\n"; } # ... hexdump($x);

    you'd get

    c3 82 c2 a9 c3 82 c2 ae c3 a2 c2 84 c2 a2 c2 a9 c2 ae e2 84 a2

    with the first 4 bytes showing the result (UTF-8 encoding) of the conversion I tried to describe above.

    Or, fully expanded:

    _________ $r (auto-upgraded) _________ ________ $u ________ c2 a9 c2 ae e2 84 a2 c2-a9 c2-ae e2-84-a2 | | | | | | | | | | | | | | c3-82 c2-a9 c3-82 c2-ae c3-a2 c2-84 c2-a2 c2-a9 c2-ae e2-84-a2 U0084 (TM)
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by graff (Chancellor) on Jun 18, 2008 at 06:36 UTC
    You said:

    The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.

    Well, if the "non-utf8 strings" happen to be all ascii characters (ord()<128), then it won't matter, because they are just a proper subset of utf8, and concatenating these with utf8 strings causes no problem.

    But if a "non-utf8" string happens to also be "non-ascii", then what would you expect to happen when you concatenate this with a utf8 string? What would you expect to do with the result of such a concatenation? (Hint: unless the answer is something strange and ad-hoc involving pack and unpack, then the real answer is: something incoherent.)

    You can't just throw utf8 characters and non-utf8/non-ascii data into a single scalar value and expect to get anything usable. If you combine data this way, the bug you expose is not in perl, but rather in your expectations.

    Either keep these data types separate at all times, or else, if the latter type is actually character data in some other encoding, then decode() it into utf8 characters (refer to the Encode module) -- or alternatively, encode() the utf8 string into the same character set as the other data, before concatenating.

    UPDATE: Actually, as pointed out by almut, perl's default behavior (interpret non-ascii/non-utf8 bytes as Latin-1 characters) makes it possible that one of the "more likely" situations -- converting some old single-byte Latin-1 text data to utf8 -- can be handled automatically, and produces a coherent result. It's only when the non-utf8 data is neither ascii nor Latin-1 that the trouble starts.

Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by jbert (Priest) on Jun 18, 2008 at 14:26 UTC
    In case it's not obvious from what other people have said above:
    • Perl is autoconverting your non-tagged string to utf8 for you. In doing so, it assumes it is already in an encoding (iso-latin-1). This assumption is what is at odds with your expectations (you're thinking of this data as a series of utf8 chars, rather than a series of latin-1 chars).
    • Everything should work out OK as long as you ensure the inputs+outputs to your program tag data appropriately. That is, look into 'binmode' to set the :utf8 flag on a filehandle, and/or the 'open' module listed above, and perhaps -Cio cmdline option.
    • Other sources of data can be a pain. e.g. stuff pulled from a db. There are ways around this (see mysql_enable_utf8 in DBD::mysql, and associated charset setttings on the db server side).
    • The thing to remember is that you don't want a mix of utf8 tagged and non-tagged data loose in your code. The best way to achieve this is to ensure that all data is tagged at the entry points.
    • Some CPAN modules just don't seem to play nicely with correctly-tagged utf8 data. (e.g. Template::Toolkit requires that you stick a byte-order-mark in your templates (ugh) rather than allowing you to tell it an encoding).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://692551]
Approved by toolic
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2021-06-17 18:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (84 votes). Check out past polls.

    Notices?