Substring giving strange results on $1 with utf8

by choroba (Cardinal)
choroba has asked for the wisdom of the Perl Monks concerning the following question:

Most honourable brothers and sisters in Perl. At work, we are having issues with substr. It returns strange values when called on the special variable $1.
The following code creates the test script and runs it:
use strict; use warnings; open my $PL, '>', '' or die $!; print {$PL} << '__PL__'; ########################################################## use strict; use warnings; binmode STDOUT, ':utf8'; binmode STDIN, ':utf8'; while (my $line = <>) { if (my ($word) = $line =~ /^(.+)$/) { my $one = substr($1, 0, 1); # doesn't work my $w_one = substr($word, 0, 1); # works print "'$one' = '$w_one'\tat $line" unless $one eq $w_one; } } ########################################################## __PL__ open my $OUT1, '>', 'utf1' or die $!; print {$OUT1} map chr hex, qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0a/; close $OUT1; open my $OUT2, '>', 'utf2' or die $!; print {$OUT2} map chr hex, qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/; close $OUT2; system "$^X < utf1"; print "\n"; system "$^X < utf2";
The output is (tested in blead 5.17.3, on x86_64-linux-thread-multi):
'�' = 'č'       at č

'aař' = 'a'     at aař
'č�' = 'č'      at č
Do you have any explanation? Should I submit a bugreport?

Update: Thanks all, bugreport sent.

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Re: Substring giving strange results on $1 with utf8
by Anonymous Monk on Aug 07, 2012 at 03:32 UTC

    I don't like that so I rewrote it

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $utf1 = pack 'H*', join'', qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0 +a/; my $utf2 = pack 'H*', join'', qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/; dd $utf1, $utf2; utf8::decode( $utf1 ); utf8::decode( $utf2 ); dd $utf1, $utf2; for my $string ( $utf1, $utf2 ){ my @lines = split /\n/, $string; for my $line ( @lines ){ if( my($word) = $line =~ /^(.+)$/ ){ my $one = substr $1, 0, 1; my $wone = substr $word, 0, 1; dd { word => $word, 1 => $1, one => $one , wone => $wone } +; } } dd \@lines; dd; } __END__ ( "aa\xC5\x99\n\xC4\x8D\naa\xC5\x99\n", "\xC4\x8D\naa\xC5\x99\n\xC4\x8D\n", ) ( "aa\x{159}\n\x{10D}\naa\x{159}\n", "\x{10D}\naa\x{159}\n\x{10D}\n", ) { 1 => "aa\x{159}", one => "a", wone => "a", word => "aa\x{159}" } { 1 => "\x{10D}", one => "Ä", wone => "\x{10D}", word => "\x{10D}" } { 1 => "aa\x{159}", one => "a", wone => "a", word => "aa\x{159}" } ["aa\x{159}", "\x{10D}", "aa\x{159}"] () { 1 => "\x{10D}", one => "Ä", wone => "\x{10D}", word => "\x{10D}" } { 1 => "aa\x{159}", one => "a", wone => "a", word => "aa\x{159}" } { 1 => "\x{10D}", one => "Ä", wone => "\x{10D}", word => "\x{10D}" } ["\x{10D}", "aa\x{159}", "\x{10D}"] ()

    I've got perl 5.014001 and I get these from Data::Dump Malformed UTF-8 character (unexpected end of string) and Wide character in print , it appears the utf flag gets turned off when substr-ing on $1 under some cases, and it appears this bug has surfaced before

    But I'm no expert

Re: Substring giving strange results on $1 with utf8
by Khen1950fx (Canon) on Aug 07, 2012 at 02:45 UTC
    I tried it with a different twist, using perl-5.17.2:
    #!/usr/bin/perl BEGIN { $| = 1; $^W = 1; } use autodie; use strictures 1; use common::sense; open my $PL, '>', '' or die $!; print {$PL} << '__PL__'; binmode STDIN, ":utf8"; binmode STDOUT, ":encoding(UTF-8)"; while (defined(my $line = <ARGV>)) { if (my($word) = $line =~ /^(.+)$/) { my $one = substr($word, 0, 1); my $w_one = substr($word, 0, 1); print "'$one' = 'one'\t at $line" if $one eq $1; print "'$w_one' = 'w_one'\t at $line" unless $w_one eq $1; } } __PL__ open my $OUT1, '>', 'utf1' or die $!; print {$OUT1} map chr hex, qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0a/; close $OUT1; open my $OUT2, '>', 'utf2' or die $!; print {$OUT2} map chr hex, qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/; close $OUT2; system "$^X < utf1"; print "\n"; system "$^X < utf2";
    Does that help?

      How do you think that helps?

      The code no longer demonstrates the perl bug, how can that help?

      Typical Khen1950fx nonsense, poke the code with a stick and post the changes

        Typical AnonymousBug nonsense:);
Re: Substring giving strange results on $1 with utf8
by choroba (Cardinal) on Aug 31, 2012 at 10:38 UTC
    The bug has been fixed. Even if there are some related issues still, as you may see in the thread.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

