Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Substring giving strange results on $1 with utf8

by choroba (Cardinal)
on Aug 06, 2012 at 22:01 UTC ( [id://985845]=perlquestion: print w/replies, xml ) Need Help??

choroba has asked for the wisdom of the Perl Monks concerning the following question:

Most honourable brothers and sisters in Perl. At work, we are having issues with substr. It returns strange values when called on the special variable $1.
The following code creates the test script and runs it:
use strict; use warnings; open my $PL, '>', 'utf2.pl' or die $!; print {$PL} << '__PL__'; ########################################################## use strict; use warnings; binmode STDOUT, ':utf8'; binmode STDIN, ':utf8'; while (my $line = <>) { if (my ($word) = $line =~ /^(.+)$/) { my $one = substr($1, 0, 1); # doesn't work my $w_one = substr($word, 0, 1); # works print "'$one' = '$w_one'\tat $line" unless $one eq $w_one; } } ########################################################## __PL__ open my $OUT1, '>', 'utf1' or die $!; print {$OUT1} map chr hex, qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0a/; close $OUT1; open my $OUT2, '>', 'utf2' or die $!; print {$OUT2} map chr hex, qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/; close $OUT2; system "$^X utf2.pl < utf1"; print "\n"; system "$^X utf2.pl < utf2";
The output is (tested in blead 5.17.3, on x86_64-linux-thread-multi):
'�' = 'č'       at č

'aař' = 'a'     at aař
'č�' = 'č'      at č
Do you have any explanation? Should I submit a bugreport?

Update: Thanks all, bugreport sent.

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re: Substring giving strange results on $1 with utf8
by Anonymous Monk on Aug 07, 2012 at 03:32 UTC

    The following code creates the test script and runs it:

    I don't like that so I rewrote it

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $utf1 = pack 'H*', join'', qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0 +a/; my $utf2 = pack 'H*', join'', qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/; dd $utf1, $utf2; utf8::decode( $utf1 ); utf8::decode( $utf2 ); dd $utf1, $utf2; for my $string ( $utf1, $utf2 ){ my @lines = split /\n/, $string; for my $line ( @lines ){ if( my($word) = $line =~ /^(.+)$/ ){ my $one = substr $1, 0, 1; my $wone = substr $word, 0, 1; dd { word => $word, 1 => $1, one => $one , wone => $wone } +; } } dd \@lines; dd; } __END__ ( "aa\xC5\x99\n\xC4\x8D\naa\xC5\x99\n", "\xC4\x8D\naa\xC5\x99\n\xC4\x8D\n", ) ( "aa\x{159}\n\x{10D}\naa\x{159}\n", "\x{10D}\naa\x{159}\n\x{10D}\n", ) { 1 => "aa\x{159}", one => "a", wone => "a", word => "aa\x{159}" } { 1 => "\x{10D}", one => "Ä", wone => "\x{10D}", word => "\x{10D}" } { 1 => "aa\x{159}", one => "a", wone => "a", word => "aa\x{159}" } ["aa\x{159}", "\x{10D}", "aa\x{159}"] () { 1 => "\x{10D}", one => "Ä", wone => "\x{10D}", word => "\x{10D}" } { 1 => "aa\x{159}", one => "a", wone => "a", word => "aa\x{159}" } { 1 => "\x{10D}", one => "Ä", wone => "\x{10D}", word => "\x{10D}" } ["\x{10D}", "aa\x{159}", "\x{10D}"] ()

    I've got perl 5.014001 and I get these from Data::Dump Malformed UTF-8 character (unexpected end of string) and Wide character in print , it appears the utf flag gets turned off when substr-ing on $1 under some cases, and it appears this bug has surfaced before https://rt.perl.org/rt3/Public/Search/Simple.html?q=%241%20utf

    But I'm no expert

Re: Substring giving strange results on $1 with utf8
by Khen1950fx (Canon) on Aug 07, 2012 at 02:45 UTC
    I tried it with a different twist, using perl-5.17.2:
    #!/usr/bin/perl BEGIN { $| = 1; $^W = 1; } use autodie; use strictures 1; use common::sense; open my $PL, '>', 'utf2.pl' or die $!; print {$PL} << '__PL__'; binmode STDIN, ":utf8"; binmode STDOUT, ":encoding(UTF-8)"; while (defined(my $line = <ARGV>)) { if (my($word) = $line =~ /^(.+)$/) { my $one = substr($word, 0, 1); my $w_one = substr($word, 0, 1); print "'$one' = 'one'\t at $line" if $one eq $1; print "'$w_one' = 'w_one'\t at $line" unless $w_one eq $1; } } __PL__ open my $OUT1, '>', 'utf1' or die $!; print {$OUT1} map chr hex, qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0a/; close $OUT1; open my $OUT2, '>', 'utf2' or die $!; print {$OUT2} map chr hex, qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/; close $OUT2; system "$^X utf2.pl < utf1"; print "\n"; system "$^X utf2.pl < utf2";
    Does that help?

      I tried it with a different twist, using perl-5.17.2: ... Does that help?

      How do you think that helps?

      The code no longer demonstrates the perl bug, how can that help?

      Typical Khen1950fx nonsense, poke the code with a stick and post the changes

        Typical AnonymousBug nonsense:);
Re: Substring giving strange results on $1 with utf8
by choroba (Cardinal) on Aug 31, 2012 at 10:38 UTC
    The bug has been fixed. Even if there are some related issues still, as you may see in the thread.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://985845]
Approved by BrowserUk
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2025-06-15 10:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.