I used utf8::upgrade() as a pure Perl example, so that I wouldn't have to resort to Inline::C or XS and more people would be able to run the sample code.
Perl_sv_utf8_upgrade_flags_grow is one of those root functions that's invoked via many wrappers, though, like Perl_do_openn or Perl_sv_setsv_flags. There are many ways to get at it.
As noted, I discovered the issue via the SvPVutf8 XS macro. The devel branch of one of my CPAN distros, KinoSearch, is a mostly-C library which uses UTF-8 strings exclusively internally. Therefore, I use SvPVutf8 rather than SvPV for accessing string pointers from arguments.
If anybody ever uses $1 as an argument to any XS library function which uses SvPVutf8, it will get upgraded, triggering the bug:
$category =~ /(\w+)/
my $term_query = KinoSearch::Search::TermQuery->new(
field => 'category',
term => $1,
Other libraries which use SvPVutf8 include Mail::SpamAssassin, Glib, Tk, etc. However, I suspect that the problem isn't limited to us. It's more that using m//g is a little esoteric, and many functions reset $1 by turning off the SVf_POK flag -- e.g. length($1) will do it. So the problem tends not to persist for very long -- but while it does, you can get some maddeningly subtle bugs!
I think that using $1, $_, $@, and friends as arguments to external methods/subs is always a bad idea. You just don't know what's going to be done in between regardless of issues like the one you found. So I'd say documenting the issue is all that's necessary.
By the way, I think KinoSearch is fantastic. I can't thank you enough for doing it.
As a practical matter, I think your recommendations to end users regarding
using special variables as arguments are sound advice. The same holds true
for variables which are overloaded, tied, and so on. Partly this is because
XS modules have options with regards to how they treat arguments, and it's
hard to get everything right.
In this case, however, I believe that the problem is both contained and
solvable. If I'm right, $1 should never have its SVf_POK flag set -- it will
always have SVp_POK set, indicating that it has a valid "private pointer", but
never the SVf_POK flag. From sv.h:
#define SVf_IOK 0x00000100 /* has valid public integer value
#define SVf_NOK 0x00000200 /* has valid public numeric value
#define SVf_POK 0x00000400 /* has valid public pointer value
#define SVf_ROK 0x00000800 /* has a valid reference pointer *
#define SVp_IOK 0x00001000 /* has valid non-public integer va
#define SVp_NOK 0x00002000 /* has valid non-public numeric va
#define SVp_POK 0x00004000 /* has valid non-public pointer va
The task is thus to identify any such variables within
Perl_sv_utf8_upgrade_flags_grow and ensure that the SVf_POK flag
is off when the function returns. That can be achieved either by never
turning it on in the first place, by turning it off at some point, or by
throwing an exception.
It may actually be important to ensure that the flag never gets turned on.
It's not clear to me that it's valid to call SvPV_force on $1. Should
the attempt trigger a "modification of readonly value" exception?
The questions I would like answers to are,
How are scalars with PERL_MAGIC_sv magic different from ordinary
Are there any scalars other than the capture values which are assigned
Does every scalar with PERL_MAGIC_sv magic have the SVp_POK
Can we use this "private" buffer in place of the standard string buffer
for the purposes of Perl_sv_utf8_upgrade_flags_grow, and if so, are
there any actions we need to take to ensure its safety?
Based on the answers to those questions, we should be able to come up with
the proper incantation -- either at the beginning or near the end of the
function -- to ensure that $1 leaves with its SVf_POK flag unset.