Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I used utf8::upgrade() as a pure Perl example, so that I wouldn't have to resort to Inline::C or XS and more people would be able to run the sample code.

Perl_sv_utf8_upgrade_flags_grow is one of those root functions that's invoked via many wrappers, though, like Perl_do_openn or Perl_sv_setsv_flags. There are many ways to get at it.

As noted, I discovered the issue via the SvPVutf8 XS macro. The devel branch of one of my CPAN distros, KinoSearch, is a mostly-C library which uses UTF-8 strings exclusively internally. Therefore, I use SvPVutf8 rather than SvPV for accessing string pointers from arguments.

If anybody ever uses $1 as an argument to any XS library function which uses SvPVutf8, it will get upgraded, triggering the bug:

$category =~ /(\w+)/ my $term_query = KinoSearch::Search::TermQuery->new( field => 'category', term => $1, );

Other libraries which use SvPVutf8 include Mail::SpamAssassin, Glib, Tk, etc. However, I suspect that the problem isn't limited to us. It's more that using m//g is a little esoteric, and many functions reset $1 by turning off the SVf_POK flag -- e.g. length($1) will do it. So the problem tends not to persist for very long -- but while it does, you can get some maddeningly subtle bugs!


In reply to Re^2: utf8::upgrade and $1 by creamygoodness
in thread utf8::upgrade and $1 by creamygoodness

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others cooling their heels in the Monastery: (7)
    As of 2015-07-05 19:51 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (67 votes), past polls