Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

utf8 encoding bug?

by zemplen (Novice)
on Feb 25, 2003 at 14:47 UTC ( #238442=perlquestion: print w/replies, xml ) Need Help??
zemplen has asked for the wisdom of the Perl Monks concerning the following question:

I am working with documents that use iso8859-1-2-5 encoding and some nonstandard encoding and fonts that need to be mapped to utf8. To clean things up I simply want to do substitutions within the input strings. As you can see in the test case below this does not work 100% of the time.
#perl -w require 5.8.0; use strict; use utf8; use Encode; ${^WIDE_SYSTEM_CALLS} = 1; #no warnings 'utf8'; open STDOUT, "> STDOUT"; binmode(STDOUT, ":utf8"); open STDERR, "> STDERR"; binmode(STDERR, ":utf8"); my @a_lc_grave = ('à', "\x{00E0}", 'a'); my @a_lc_diaeresis = ('ä', "\x{00E4}", 'a'); my @acy = ('а', "\x{0430}", ''); my @dcy = ('д', "\x{0434}", ''); my $text = ''; $text = ${a_lc_grave[1] . ${a_lc_diaeresis[1]}}; Encode::encode_utf8($text); &test("success"); $text = ${a_lc_diaeresis[1]} . ${a_lc_grave[1]}; Encode::encode_utf8($text); &test("fail"); $text = ${a_lc_grave[1]} . ${a_lc_grave[1]}; Encode::encode_utf8($text); &test("success"); $text = ${a_lc_diaeresis[1]} . ${a_lc_diaeresis[1]}; Encode::encode_utf8($text); &test("success"); sub test () { print "-"x20, "\n"; print $_[0], "\n"; print "Before = ", unpack ("U*", ${text}), "\n\n"; $text=~s/${a_lc_diaeresis[1]}/${dcy[1]}/g; $text=~s/${a_lc_grave[1]}/${acy[1]}/g; print "After = ", unpack ("U*", ${text}), "\n\n"; print $text, "\n\n"; }

Replies are listed 'Best First'.
Re: utf8 encoding bug?
by hv (Parson) on Feb 25, 2003 at 16:21 UTC

    This is a very strange bug. It appears to be happening because the replacement is coming from an array; witness the following code:

    #/usr/bin/perl -w require 5.8.0; use strict; my($a1, $d1) = ("\x{00E0}", "\x{00E4}"); my($a2, $d2) = ("\x{0430}", "\x{0434}"); my($a3, $d3) = (["\x{0430}"], ["\x{0434}"]); my @a4 = "\x{0430}"; my @d4 = "\x{0434}"; for (\&t2, \&t3, \&t4, \&t5) { my $text = $d1.$a1; warn "Before = ", join('.', unpack ("U*", ${text})), "\n\n"; &$_($text); warn "After = ", join('.', unpack ("U*", ${text})), "\n\n"; } sub t2 { $_[0] =~ s/$d1/$d2/g; $_[0] =~ s/$a1/$a2/g; } sub t3 { $_[0] =~ s/$d1/$d3->[0]/g; $_[0] =~ s/$a1/$a3->[0]/g; } sub t4 { $_[0] =~ s/$d1/$d4[0]/g; $_[0] =~ s/$a1/$a4[0]/g; } sub t5 { my $a5 = $a4[0]; my $d5 = $d4[0]; $_[0] =~ s/$d1/$d5/g; $_[0] =~ s/$a1/$a5/g; }

    The t3() and t4() calls fail for me under perl-5.8.0 and with recent development sources at patchlevel 18736. The very latest development sources (@18777) succeed for all four cases, so this has clearly been fixed by a very recent patch.

    The success of t5() in the above code suggests a workaround - grab the replacement variable into a scalar variable, and use that scalar for the replacement.

Re: utf8 encoding bug?
by zemplen (Novice) on Feb 25, 2003 at 17:50 UTC
    This fixed one and broke another
    sub test2 () { print "-"x20, "\n"; print $_[0], "\n"; print "Before = ", unpack ("U*", ${text}), "\n\n"; my $tmp = ${dcy[1]}; $text=~s/${a_lc_diaeresis[1]}/$tmp/g; $tmp = ${acy[1]}; $text=~s/${a_lc_grave[1]}/$tmp/g; print "After = ", unpack ("U*", ${text}), "\n\n"; print $text, "\n\n"; }

      Ah sorry, I didn't think to check the original test again.

      At least some of the problems are occurring because of bugs in perl-5.8.0 when upgrading a non-utf8 string to utf8 at odd times, so the easiest workaround I could find was to force every string to be upgraded before messing with it. I'm not sure what your Encode::encode_utf8() calls are supposed to be doing - they don't appear to be having any effect - but if I replace each of them with a force_utf8($text) using the definition below all tests appear to do the right thing:

      sub force_utf8 { chop( $_[0] .= "\x{100}" }; }

      Since your original test already works correctly under the latest development sources, I am confident that the next maintenance release (ie 5.8.1) will also include the fix.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://238442]
Approved by pfaut
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2017-12-11 04:15 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (286 votes). Check out past polls.