I ran into a problem using XML::Simple generating output XML. The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.
I found this behavior very odd so I put together a test case that shows join corrupting a non-utf8 string when join'ed with another utf8 string.
At first I thought it might be decoding the non-utf8 string using the locale (or LANG or whatever) to some other encoding, but running this on a LANG=en_US.UTF-8 system produced the same results.
Can anyone explain to me what is going on?
Sample code:
no warnings 'utf8';
use Encode qw(decode is_utf8);
$r = "\xc2\xa9\xc2\xae\xe2\x84\xa2";
print "Raw \$r : ", $r,
" - ", (is_utf8($r)?"is":"is not"), " utf8\n";
$u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2");
print "UTF8 \$u : ", $u,
" - ", (is_utf8($u)?"is":"is not"), " utf8\n";
$x = join('', $r, $u);
print "Join(\$r, \$u): ", $x,
" - ", (is_utf8($x)?"is":"is not"), " utf8\n";
$e = decode('utf8', $r);
print "Encd \$e : ", $e,
" - ", (is_utf8($e)?"is":"is not"), " utf8\n";
$y = join('', $e, $u);
print "Join(\$e, \$u): ", $y,
" - ", (is_utf8($y)?"is":"is not"), " utf8\n";
Sample Output:
Raw $r : ©®™ - is not utf8
UTF8 $u : ©®™ - is utf8
Join($r, $u): ©®â�¢©®™ - is utf8
Encd $e : ©®™ - is utf8
Join($e, $u): ©®™©®™ - is utf8
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.