Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: RegExp breaks in Perl 5.10

by grinder (Bishop)
on Mar 06, 2008 at 20:24 UTC ( [id://672572]=note: print w/replies, xml ) Need Help??


in reply to RegExp breaks in Perl 5.10

Hmm, I was sufficiently surprised by this behaviour (that I've not heard of before) that I went looking. First off, your code fragment is not much use, as it does not define what $R2 contains. So I went and looked at the source, and ripped the following out of its guts:

use strict; use warnings; my @word = qw( constituci\xf3n contribuci\xf3n destituci\xf3n devoluci\xf3n dismi +nuci\xf3n constituciones contribuciones destituciones devoluciones disminuci +ones foo ); my $vowels = 'aeiou\xe1\xe9\xed\xf3\xfa\xfc'; my $consonants = 'bcdfghjklmn\xf1pqrstvwxyz'; my $revowel = qr/[$vowels]/; my $reconsonants = qr/[$consonants]/; my $R2; my $suffix; for my $word (@word) { ($R2) = $word =~ /^.*?$revowel$reconsonants.*?$revowel$reconsonant +s(.*)$/; $R2 ||= ''; if ( ($suffix) = $R2 =~ /(uciones|uci\xf3n)$/ ) { # uci\xf3n uciones # replace with u if in R2 $word =~ s/$suffix$/u/; print "Step 1 case 4: $word\n"; } }

(Those \xnn characters really are Latin-1 characters, that's just a direct cut'n'paste from my shell introducing the artifact).

And that runs just fine here, all the way up to "perl, v5.11.0 DEVEL33323 built for i386-freebsd-64int". So there's something else going on. Both "ución" and "uciones" match just fine. Perhaps the tester platforms are running in a different locale. To play it safe, I suggest you encode your program in UTF-8 and slap a use utf8 at the top and be done with it. At least I think that's the correct best practice. Thinking about encoding makes my head explode.

• another intruder with the mooring in the heart of the Perl

Replies are listed 'Best First'.
Re^2: RegExp breaks in Perl 5.10
by almut (Canon) on Mar 06, 2008 at 21:13 UTC

    I think the issue with the module's original code is that the one side of the match has been decoded from UTF-8 (the word list from the file) while the other is in Latin1 (the literal strings in the source). In your test case, both are in Latin1, so they match.

    When adding (at the beginning of the loop)

    $word = Encode::decode("iso-8859-1", $word); # force utf8 flag on print "$word:\n";

    I can reproduce the problem, i.e. when forcing utf8, I get

    constitución:
    contribución:
    destitución:
    devolución:
    disminución:
    constituciones:
    Step 1 case 4: constitu
    contribuciones:
    Step 1 case 4: contribu
    destituciones:
    Step 1 case 4: destitu
    devoluciones:
    Step 1 case 4: devolu
    disminuciones:
    Step 1 case 4: disminu
    foo:
    

    while with your original test, the output is

    constitución:
    Step 1 case 4: constitu
    contribución:
    Step 1 case 4: contribu
    destitución:
    Step 1 case 4: destitu
    devolución:
    Step 1 case 4: devolu
    disminución:
    Step 1 case 4: disminu
    constituciones:
    Step 1 case 4: constitu
    contribuciones:
    Step 1 case 4: contribu
    destituciones:
    Step 1 case 4: destitu
    devoluciones:
    Step 1 case 4: devolu
    disminuciones:
    Step 1 case 4: disminu
    foo:
    
Re^2: RegExp breaks in Perl 5.10
by eserte (Deacon) on Mar 06, 2008 at 20:52 UTC
    If there's no "use locale" in the script then it should be not locale-dependent.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://672572]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-24 17:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found