Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Odd problems with UTF-8, regexps, and newer Perl versions

by ablegrape (Initiate)
on Jun 04, 2010 at 20:42 UTC ( [id://843208]=perlquestion: print w/replies, xml ) Need Help??

ablegrape has asked for the wisdom of the Perl Monks concerning the following question:

Silly me, I upgraded my development Mac to Snow Leopard, and a bunch of my UTF-8 code broke with the newer version of Perl (5.10.0). I've isolated the problem to a strange behavior with regular expressions, have RT'ed the FM, and can't find an explanation in any of the expected behaviors of newer Perl. Thinking this might be a bug, I've rolled forward to 5.12.0/1, but the problem persists. I could try to roll back to the older (5.6?) Perl, but would prefer to understand what's going on, and fix my code, if possible.

Here's a simple test case. The string in question is valid UTF-8 as far as I can tell (same problem persists when reading from a UTF-8 file), and works with most regular expressions, just not a very specific combination of them.

#!/usr/bin/perl use strict vars; use utf8; use encoding 'utf8'; my $e = "Böck"; if (utf8::is_utf8($e)) { print "yep, is UTF8\n"; } # this fails with: Malformed UTF-8 character # seems to require the combination of a minimum-length wildcard match # + non-matching character class. For example: # m/.*?[k]$/ succeeds # m/.*?x$/ succeeds # m/.*[x]$/ succeeds if ($e=~ m/.*?[x]$/) { print "matched\n"; } print "success with $e\n";
The program dies thus:
% ./test.pl yep, is UTF8 Malformed UTF-8 character (fatal) at ./test.pl line 17.

Have tried lots of things, to no avail. Perhaps some monk more adept than I will have a clue as to how to approach this?

Many thanks!

Replies are listed 'Best First'.
Re: Odd problems with UTF-8, regexps, and newer Perl versions
by almut (Canon) on Jun 04, 2010 at 21:17 UTC

    For me the problem goes away when I comment out the use encoding 'utf8' line (tested with 5.10.1).

    Why do you think you need it? — use utf8 already tells Perl that the script source is in UTF-8 (and you can always use binmode to change layers for STDIN and STDOUT).

      I can replicate the problem in perl 5.10.0 too, but not in 5.8.8. almut's solution solves it.

      Thanks for the quick reply. I tried that, too, and while the regexp then works, the behavior changes.

      With only 'use utf8':
      % ./test.pl
      yep, is UTF8
      success with B?ck
      

      I see, "use encoding" also sets binmode on STDIN and STDOUT, so that while just using 'use' I need to explicitly add the binmode.

      With use utf8 plus "binmode STDOUT ':utf8'":

      % ./test.pl
      yep, is UTF8
      success with Böck
      

      (My, Perl's unicode handling is complicated.) Now to see if I can apply this learning successfully to the original application, which is far more complex...

        I see, "use encoding" also sets binmode on STDIN and STDOUT, so that while just using 'use' I need to explicitly add the binmode.

        You can also use the open pragma for that, and also for future calls to open.

        Perl 6 - links to (nearly) everything that is Perl 6.
Re: Odd problems with UTF-8, regexps, and newer Perl versions
by proceng (Scribe) on Jun 04, 2010 at 22:52 UTC
    The code works for me without either the "use utf8" or "use encoding 'utf8'" statements. It works in 5.8.9, 5.10.1 and 5.12.1 (all three are installed on this system independent of each other).

    A look at the doc page (perldoc utf8) shows the following:

    Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without "use utf8;".
    When UTF-8 becomes the standard source format, this pragma will effectively become a no-op.
    The following functions are defined in the "utf8::" package by the Perl core. You do not need to say "use utf8" to use these and in fact you should not say that unless you really want to have UTF-8 source code.
    So, try it without either "use" statement and see if the behaviour changes (for better or worse ;-)).

    Also, I noted (belatedly) that a rollback is to v5.6. This snip from the doc's may explain:

    While some limited functionality towards this does exist as of Perl 5.8.0, that is more accidental than designed; use of Unicode for the said purposes is unsupported.
      Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

      OTOH. it's likely that the OP's code is written in UTF-8 — i.e. the string "Böck" is represented in the source file as the bytes 42 c3 b6 63 68, and not as 42 f6 63 68 (Latin-1).

      Otherwise (with Latin-1), he would be getting "Malformed UTF-8 character (unexpected non-continuation byte 0x63, immediately after start byte 0xf6) at ./843208.pl line 7." with the two use directives enabled (which is different from what's shown).

      Also, without either or both of the use directives enabled, the variable would not have the utf8 flag on (i.e. no "yep, is UTF8" message), irrespective of whether it's encoded as UTF-8 or Latin-1.  This would of course fundamentally change how it's handled internally...

        Yes, as almut pointed out, my source is in UTF-8, so I do need the pragma.

        But the plot thickens:

        Going back to my original code, switching the "use encoding" for "use utf8" did not fix things. The original regular expression was much more complex, and it still dies. I've verified that even a tiny bit more complex RE will still fail even using "use utf8". It did seem a little "magical" that simply removing what should have been a harmless pragma made things work...

        The modified example follows; I ran on 5.12.1. What am I missing? Your sage help is much appreciated!

        #!/usr/bin/perl use strict vars; use utf8; binmode STDOUT, ":utf8"; my $e = "Böck"; if (utf8::is_utf8($e)) { print "yep, is UTF8: $e\n"; } # this succeeds (failed before with use encoding 'utf8', unknown why) if ($e=~ m/.*?[x]$/) { print "matched simple\n"; } print "success with simple\n"; # these die if ($e=~ m/.*?\p{Space}$/) { print "matched medium\n"; } print "success with medium\n"; if ($e=~ m/.*?[xyz]$/) { print "matched medium\n"; } print "success with medium\n"; # the original, full expression. Naturally, this dies. if ($e =~ m/(.*?)[,\p{isSpace}]+((?:\p{isAlpha}[\p{isSpace}\.]{1,2})+) +\p{isSpace}*$/) { print "matched complex\n"; } print "success with complex\n";

      The code works for me without either the "use utf8"

      Except, say, if you took the length of the variable.

Re: Odd problems with UTF-8, regexps, and newer Perl versions (/i)
by tye (Sage) on Jun 05, 2010 at 06:11 UTC

    Add /i and I get a more verbose (and non-fatal) error only if "use encoding" is commented out:

    Malformed UTF-8 character (unexpected continuation byte 0xb6, with no +preceding start byte) in pattern match (m//)

    Which indicates that the regex engine is starting a step at the second byte of the multi-byte character.

    And the ways that the error comes and goes for nonsensical changes makes me suspect there might be something like alignment or buffer overflow involved.

    (Updated.)

    - tye        

Re: Odd problems with UTF-8, regexps, and newer Perl versions
by westrock2000 (Beadle) on Jun 05, 2010 at 11:38 UTC
    I dont know if this helps you but here is how I got UTF-8 to work across systems using both 5.8 and 5.6.1 (and had to use uxterm on the 5.6.1 to get xterm to display it correctly)
    #!/usr/bin/perl -w BEGIN{ if ($] < 5.008){ require utf8; utf8->import(); } } if ($] >= 5.008){ binmode STDOUT, 'utf8';}
    Basically if perl version ($]) is below 5.8 is uses one method of setting UTF-8 and if its equal to 5.8 or above it sets UTF-8 another way. I don't know how proper this is, but I was getting all kinds of trash whenever I tried to display Unicode (ISO-10646) on older Red Hat 7.3 in Xterm. After about 2 days of surfing I came up with the combination of using that in the script and launching in uxterm...now Unicode calls display properly on both types of systems.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://843208]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-16 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found