Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

utf8 encoding bug?

by zemplen (Novice)
on Feb 25, 2003 at 14:47 UTC ( #238442=perlquestion: print w/ replies, xml ) Need Help??
zemplen has asked for the wisdom of the Perl Monks concerning the following question:

I am working with documents that use iso8859-1-2-5 encoding and some nonstandard encoding and fonts that need to be mapped to utf8. To clean things up I simply want to do substitutions within the input strings. As you can see in the test case below this does not work 100% of the time.
#perl -w require 5.8.0; use strict; use utf8; use Encode; ${^WIDE_SYSTEM_CALLS} = 1; #no warnings 'utf8'; open STDOUT, "> STDOUT"; binmode(STDOUT, ":utf8"); open STDERR, "> STDERR"; binmode(STDERR, ":utf8"); my @a_lc_grave = ('à', "\x{00E0}", 'a'); my @a_lc_diaeresis = ('ä', "\x{00E4}", 'a'); my @acy = ('а', "\x{0430}", ''); my @dcy = ('д', "\x{0434}", ''); my $text = ''; $text = ${a_lc_grave[1] . ${a_lc_diaeresis[1]}}; Encode::encode_utf8($text); &test("success"); $text = ${a_lc_diaeresis[1]} . ${a_lc_grave[1]}; Encode::encode_utf8($text); &test("fail"); $text = ${a_lc_grave[1]} . ${a_lc_grave[1]}; Encode::encode_utf8($text); &test("success"); $text = ${a_lc_diaeresis[1]} . ${a_lc_diaeresis[1]}; Encode::encode_utf8($text); &test("success"); sub test () { print "-"x20, "\n"; print $_[0], "\n"; print "Before = ", unpack ("U*", ${text}), "\n\n"; $text=~s/${a_lc_diaeresis[1]}/${dcy[1]}/g; $text=~s/${a_lc_grave[1]}/${acy[1]}/g; print "After = ", unpack ("U*", ${text}), "\n\n"; print $text, "\n\n"; }

Comment on utf8 encoding bug?
Download Code
Re: utf8 encoding bug?
by hv (Parson) on Feb 25, 2003 at 16:21 UTC

    This is a very strange bug. It appears to be happening because the replacement is coming from an array; witness the following code:

    #/usr/bin/perl -w require 5.8.0; use strict; my($a1, $d1) = ("\x{00E0}", "\x{00E4}"); my($a2, $d2) = ("\x{0430}", "\x{0434}"); my($a3, $d3) = (["\x{0430}"], ["\x{0434}"]); my @a4 = "\x{0430}"; my @d4 = "\x{0434}"; for (\&t2, \&t3, \&t4, \&t5) { my $text = $d1.$a1; warn "Before = ", join('.', unpack ("U*", ${text})), "\n\n"; &$_($text); warn "After = ", join('.', unpack ("U*", ${text})), "\n\n"; } sub t2 { $_[0] =~ s/$d1/$d2/g; $_[0] =~ s/$a1/$a2/g; } sub t3 { $_[0] =~ s/$d1/$d3->[0]/g; $_[0] =~ s/$a1/$a3->[0]/g; } sub t4 { $_[0] =~ s/$d1/$d4[0]/g; $_[0] =~ s/$a1/$a4[0]/g; } sub t5 { my $a5 = $a4[0]; my $d5 = $d4[0]; $_[0] =~ s/$d1/$d5/g; $_[0] =~ s/$a1/$a5/g; }

    The t3() and t4() calls fail for me under perl-5.8.0 and with recent development sources at patchlevel 18736. The very latest development sources (@18777) succeed for all four cases, so this has clearly been fixed by a very recent patch.

    The success of t5() in the above code suggests a workaround - grab the replacement variable into a scalar variable, and use that scalar for the replacement.

Re: utf8 encoding bug?
by zemplen (Novice) on Feb 25, 2003 at 17:50 UTC
    This fixed one and broke another
    sub test2 () { print "-"x20, "\n"; print $_[0], "\n"; print "Before = ", unpack ("U*", ${text}), "\n\n"; my $tmp = ${dcy[1]}; $text=~s/${a_lc_diaeresis[1]}/$tmp/g; $tmp = ${acy[1]}; $text=~s/${a_lc_grave[1]}/$tmp/g; print "After = ", unpack ("U*", ${text}), "\n\n"; print $text, "\n\n"; }

      Ah sorry, I didn't think to check the original test again.

      At least some of the problems are occurring because of bugs in perl-5.8.0 when upgrading a non-utf8 string to utf8 at odd times, so the easiest workaround I could find was to force every string to be upgraded before messing with it. I'm not sure what your Encode::encode_utf8() calls are supposed to be doing - they don't appear to be having any effect - but if I replace each of them with a force_utf8($text) using the definition below all tests appear to do the right thing:

      sub force_utf8 { chop( $_[0] .= "\x{100}" }; }

      Since your original test already works correctly under the latest development sources, I am confident that the next maintenance release (ie 5.8.1) will also include the fix.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://238442]
Approved by pfaut
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2015-07-06 09:04 GMT
Find Nodes?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...

    Results (70 votes), past polls