<?xml version="1.0" encoding="windows-1252"?>
<node id="792157" title="utf8::upgrade and $1" created="2009-08-30 02:15:54" updated="2009-08-30 02:15:54">
<type id="115">
perlquestion</type>
<author id="384960">
creamygoodness</author>
<data>
<field name="doctext">
&lt;p&gt;Greets,&lt;/p&gt;

&lt;p&gt;In the Perl core, there's a function called
&lt;c&gt;Perl_sv_utf8_upgrade_flags_grow&lt;/c&gt;.
Many paths ultimately lead to this function; I got there by way of
the XS macro &lt;c&gt;SvPVutf8&lt;/c&gt;, but the easiest way invoke it from Perl-space is
via &lt;c&gt;utf8::upgrade&lt;/c&gt;.&lt;/p&gt;

&lt;p&gt;It turns out that this function doesn't play nice with capture variables like
&lt;c&gt;$1&lt;/c&gt;.  After it is invoked on them, they no longer capture properly:&lt;/p&gt;

&lt;p&gt;&lt;c&gt;
marvin@smokie:~/perltest $ bleadperl dollar_one_utf8_upgrade.pl 
Without utf8::upgrade...
a
b
c
With utf8::upgrade...
a
a
a
Problem persists...
a
a
a
marvin@smokie:~/perltest $ 
&lt;/c&gt;
&lt;/p&gt;

&lt;readmore&gt;
&lt;p&gt;
Here's the test script:
&lt;/p&gt;

&lt;p&gt;
&lt;c&gt;
use strict;
use warnings;

my $text = "a b c";
print "Without utf8::upgrade...\n";
while ( $text =~ /(\S)/g ) { 
    print "$1\n";
}
print "With utf8::upgrade...\n";
while ( $text =~ /(\S)/g ) { 
    print "$1\n";
    utf8::upgrade($1);
}

print "Problem persists...\n";
my $more_text = "d e f";
while ( $more_text =~ /(\S)/g ) { 
    print "$1\n";
}

&lt;/c&gt;
&lt;/p&gt;

&lt;p&gt;The problem appears to be that &lt;c&gt;Perl_sv_utf8_upgrade_flags_grow&lt;/c&gt; turns
on the &lt;c&gt;SVf_POK&lt;/c&gt; flag by way of &lt;c&gt;SvPV_force&lt;/c&gt;.  It doesn't seem to
have anything to do with whether the &lt;c&gt;SVf_UTF8&lt;/c&gt; flag is on in either the
string being regexed or the capture variable itself.&lt;/p&gt;

&lt;p&gt;Applying the following patch to sv.c in blead appears to kill the bug at the
source.  All of Perl's test cases still pass after it is applied.&lt;/p&gt;

&lt;p&gt;
&lt;c&gt;
marvin@smokie:~/projects/perl-git $ git diff
diff --git a/sv.c b/sv.c
index a53669a..280f064 100644
--- a/sv.c
+++ b/sv.c
@@ -3231,6 +3231,8 @@ Perl_sv_utf8_upgrade_flags_grow(pTHX_ register SV *const sv, const I32 flags, 
                if (extra) SvGROW(sv, SvCUR(sv) + extra);
                return len;
            }
+        } else if (SvGMAGICAL(sv) &amp;&amp; mg_find(sv, PERL_MAGIC_sv)) {
+            ;
        } else {
            (void) SvPV_force(sv,len);
        }
&lt;/c&gt;
&lt;/p&gt;

&lt;p&gt;I'd like to solve this issue and supply a working patch via perlbug, so I can
say that I solved a UTF-8 bug in the Perl core.  :)  However, I'm not sure
that this patch is legit, because I don't understand exactly what
&lt;c&gt;PERL_MAGIC_sv&lt;/c&gt; is all about.&lt;/p&gt;

&lt;p&gt;I think what's going on is that $1 and friends are magical variables that
should never have the &lt;c&gt;SVf_POK&lt;/c&gt; flag on, since that indicates that they
contain real strings.  The regex engine probably doesn't use the standard string assignment interface and goes through the magic interface instead, hence its work is no longer visible once the standard channel is open.  But is it safe to have the &lt;c&gt;SVf_POK&lt;/c&gt; flag off for
the remainder of &lt;c&gt;Perl_sv_utf8_upgrade_flags_grow&lt;/c&gt;?  &lt;c&gt;SvPV_force&lt;/c&gt;
was probably called for a reason, after all.&lt;/p&gt;


</field>
</data>
</node>
