Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

utf8::upgrade and $1

by creamygoodness (Curate)
on Aug 30, 2009 at 06:15 UTC ( #792157=perlquestion: print w/ replies, xml ) Need Help??
creamygoodness has asked for the wisdom of the Perl Monks concerning the following question:


In the Perl core, there's a function called Perl_sv_utf8_upgrade_flags_grow. Many paths ultimately lead to this function; I got there by way of the XS macro SvPVutf8, but the easiest way invoke it from Perl-space is via utf8::upgrade.

It turns out that this function doesn't play nice with capture variables like $1. After it is invoked on them, they no longer capture properly:

marvin@smokie:~/perltest $ bleadperl Without utf8::upgrade... a b c With utf8::upgrade... a a a Problem persists... a a a marvin@smokie:~/perltest $

Here's the test script:

use strict; use warnings; my $text = "a b c"; print "Without utf8::upgrade...\n"; while ( $text =~ /(\S)/g ) { print "$1\n"; } print "With utf8::upgrade...\n"; while ( $text =~ /(\S)/g ) { print "$1\n"; utf8::upgrade($1); } print "Problem persists...\n"; my $more_text = "d e f"; while ( $more_text =~ /(\S)/g ) { print "$1\n"; }

The problem appears to be that Perl_sv_utf8_upgrade_flags_grow turns on the SVf_POK flag by way of SvPV_force. It doesn't seem to have anything to do with whether the SVf_UTF8 flag is on in either the string being regexed or the capture variable itself.

Applying the following patch to sv.c in blead appears to kill the bug at the source. All of Perl's test cases still pass after it is applied.

marvin@smokie:~/projects/perl-git $ git diff diff --git a/sv.c b/sv.c index a53669a..280f064 100644 --- a/sv.c +++ b/sv.c @@ -3231,6 +3231,8 @@ Perl_sv_utf8_upgrade_flags_grow(pTHX_ register S +V *const sv, const I32 flags, if (extra) SvGROW(sv, SvCUR(sv) + extra); return len; } + } else if (SvGMAGICAL(sv) && mg_find(sv, PERL_MAGIC_sv)) { + ; } else { (void) SvPV_force(sv,len); }

I'd like to solve this issue and supply a working patch via perlbug, so I can say that I solved a UTF-8 bug in the Perl core. :) However, I'm not sure that this patch is legit, because I don't understand exactly what PERL_MAGIC_sv is all about.

I think what's going on is that $1 and friends are magical variables that should never have the SVf_POK flag on, since that indicates that they contain real strings. The regex engine probably doesn't use the standard string assignment interface and goes through the magic interface instead, hence its work is no longer visible once the standard channel is open. But is it safe to have the SVf_POK flag off for the remainder of Perl_sv_utf8_upgrade_flags_grow? SvPV_force was probably called for a reason, after all.

Comment on utf8::upgrade and $1
Select or Download Code
Re: utf8::upgrade and $1
by Anonymous Monk on Aug 30, 2009 at 06:58 UTC
    I'm 99% sure you're never supposed to call utf8::upgrade on $1

      I used utf8::upgrade() as a pure Perl example, so that I wouldn't have to resort to Inline::C or XS and more people would be able to run the sample code.

      Perl_sv_utf8_upgrade_flags_grow is one of those root functions that's invoked via many wrappers, though, like Perl_do_openn or Perl_sv_setsv_flags. There are many ways to get at it.

      As noted, I discovered the issue via the SvPVutf8 XS macro. The devel branch of one of my CPAN distros, KinoSearch, is a mostly-C library which uses UTF-8 strings exclusively internally. Therefore, I use SvPVutf8 rather than SvPV for accessing string pointers from arguments.

      If anybody ever uses $1 as an argument to any XS library function which uses SvPVutf8, it will get upgraded, triggering the bug:

      $category =~ /(\w+)/ my $term_query = KinoSearch::Search::TermQuery->new( field => 'category', term => $1, );

      Other libraries which use SvPVutf8 include Mail::SpamAssassin, Glib, Tk, etc. However, I suspect that the problem isn't limited to us. It's more that using m//g is a little esoteric, and many functions reset $1 by turning off the SVf_POK flag -- e.g. length($1) will do it. So the problem tends not to persist for very long -- but while it does, you can get some maddeningly subtle bugs!

        I think that using $1, $_, $@, and friends as arguments to external methods/subs is always a bad idea. You just don't know what's going to be done in between regardless of issues like the one you found. So I'd say documenting the issue is all that's necessary.

        By the way, I think KinoSearch is fantastic. I can't thank you enough for doing it.

Re: utf8::upgrade and $1
by ikegami (Pope) on Aug 31, 2009 at 03:43 UTC

    The fix doesn't seem correct. Why would you want it to be a no-op for magical variables? Not even a warning!

    Seems to me it should do a mg_get initially, upgrade the string, then a mg_set after the upgrading (taking care not to set SVf_POK).

    It seems someone started to do it this way, given the presence of sv_utf8_upgrade_nomg. You might want to find the story behind that.

    If you go down that route, the following fix is necessary in sv.h:

    -#define sv_utf8_upgrade(sv) sv_utf8_upgrade_flags(sv, SV_GMAGIC) +#define sv_utf8_upgrade(sv) sv_utf8_upgrade_flags(sv, SV_GMAGIC|SV_SM +AGIC)

    If not, that line still needs fixing.

      I agree, the fix is not finished yet. I suspect that the patch works because the rest of the function "mistakes" the private string for a public string, so to speak, and that the test suite passes because either the function is never invoked on a capture variable or because no variable with PERL_MAGIC_sv passes through that function which lacks a private string. Note that the patch doesn't affect all magic variables, just those with PERL_MAGIC_sv -- which is only one type among many, despite the name. From perl.h:

      #define PERL_MAGIC_sv '\0' /* Special scalar variable */ #define PERL_MAGIC_overload 'A' /* %OVERLOAD hash */ #define PERL_MAGIC_overload_elem 'a' /* %OVERLOAD hash element */ ...

      Nevertheless, it's useful to have isolated the problematic section of code.

      I managed to hunt down some prior discussion:

      Nicholas Clark had this to say:

      POK shouldn't be on for something with GMG. It should only be pPOK, so that any calls to SvPV() etc call into Perl_sv2pv_flags(), and the get magic is triggered.

      (GMG is "get magic".) Based on Nick's post, I think we can confirm that the fact that $1 leaves this function with the SVf_POK flag on is a bug.

      demerphq had this to say with regards to why $1 doesn't carry the READONLY flag:

      Turns out that this is deliberate, as some pluggable regex engines allow for a writable $1.
      Aevar patched this in quite some time ago.
      So I think at this points its a bug in Encode, but I couldn't figure out how to fix it in the time I had available.

      Perhaps we have tracked down this "bug in Encode".

      Incidentally, this bug is a regession in Perl 5.10. Perl 5.8.8 doesn't have this problem.

        I think we can confirm that the fact that $1 leaves this function with the SVf_POK flag on is a bug.


        demerphq had this to say with regards to why $1 doesn't carry the READONLY flag:

        Not a problem for my solution. The setter called via mg_set after the upgrade will throw an error if appropriate.

        Turns out that this is deliberate, as some pluggable regex engines allow for a writable $1.

        That means that upgrading $1 might have no effect or might upgrade the entire source string in those engines, but I don't see that as a problem. Side-effects are expectedintended when playing with magical variables.

        Incidentally, this bug is a regession in Perl 5.10. Perl 5.8.8 doesn't have this problem.

        That would explain the presence of sv_utf8_upgrade_nomg.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://792157]
Approved by AnomalousMonk
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2014-07-24 11:16 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (160 votes), past polls