http://www.perlmonks.org?node_id=692551

rsmah has asked for the wisdom of the Perl Monks concerning the following question:

I ran into a problem using XML::Simple generating output XML. The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.

I found this behavior very odd so I put together a test case that shows join corrupting a non-utf8 string when join'ed with another utf8 string.

At first I thought it might be decoding the non-utf8 string using the locale (or LANG or whatever) to some other encoding, but running this on a LANG=en_US.UTF-8 system produced the same results.

Can anyone explain to me what is going on?

Sample code:

no warnings 'utf8'; use Encode qw(decode is_utf8); $r = "\xc2\xa9\xc2\xae\xe2\x84\xa2"; print "Raw \$r : ", $r, " - ", (is_utf8($r)?"is":"is not"), " utf8\n"; $u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2"); print "UTF8 \$u : ", $u, " - ", (is_utf8($u)?"is":"is not"), " utf8\n"; $x = join('', $r, $u); print "Join(\$r, \$u): ", $x, " - ", (is_utf8($x)?"is":"is not"), " utf8\n"; $e = decode('utf8', $r); print "Encd \$e : ", $e, " - ", (is_utf8($e)?"is":"is not"), " utf8\n"; $y = join('', $e, $u); print "Join(\$e, \$u): ", $y, " - ", (is_utf8($y)?"is":"is not"), " utf8\n";
Sample Output:
Raw $r : ©®™ - is not utf8 UTF8 $u : ©®™ - is utf8 Join($r, $u): ©®â�¢©®™ - is utf8 Encd $e : ©®™ - is utf8 Join($e, $u): ©®™©®™ - is utf8