|Perl: the Markov chain saw|
In the Perl core, there's a function called Perl_sv_utf8_upgrade_flags_grow. Many paths ultimately lead to this function; I got there by way of the XS macro SvPVutf8, but the easiest way invoke it from Perl-space is via utf8::upgrade.
It turns out that this function doesn't play nice with capture variables like $1. After it is invoked on them, they no longer capture properly:
Here's the test script:
The problem appears to be that Perl_sv_utf8_upgrade_flags_grow turns on the SVf_POK flag by way of SvPV_force. It doesn't seem to have anything to do with whether the SVf_UTF8 flag is on in either the string being regexed or the capture variable itself.
Applying the following patch to sv.c in blead appears to kill the bug at the source. All of Perl's test cases still pass after it is applied.
I'd like to solve this issue and supply a working patch via perlbug, so I can say that I solved a UTF-8 bug in the Perl core. :) However, I'm not sure that this patch is legit, because I don't understand exactly what PERL_MAGIC_sv is all about.
I think what's going on is that $1 and friends are magical variables that should never have the SVf_POK flag on, since that indicates that they contain real strings. The regex engine probably doesn't use the standard string assignment interface and goes through the magic interface instead, hence its work is no longer visible once the standard channel is open. But is it safe to have the SVf_POK flag off for the remainder of Perl_sv_utf8_upgrade_flags_grow? SvPV_force was probably called for a reason, after all.