Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Perl's strings are formally list of codepoints. So the following applies:
@codepoints = map ord($_), split //, $s1; $s2 = join '', map chr($_), @codepoints; ok($s1 eq $s2); ok(length($s1) == length($s2);
How perl internally stores those those codepoints is up to perl, and perl-level code mostly needn't care about the difference. XS code on the other hand needs to know about it if its going to start rummaging around accessing the individual bytes making up the string's storage. Currently perl uses two storage formats - traditional one byte per codepoint and utf8 variable length encoding. The encoding is indicated by the SVf_UTF8 flag. You can't guarantee which encoding will be used - that's up to perl. For example at the moment:
$s1 = "abc\x80"; $s2 = $s1; # currently SVf_UTF8 not set; string uses 4 bytes of + storage $s2 .= "\x{100}"; # currently perl upgrades to SVf_UTF8 and converts t +he 0x80 and 0x100 into multi-byte representations chop($s2); # currently perl doesn't downgrade; the 0x80 codepoi +nt still stored as 2 bytes ok($s1 eq $s2); ok(length($s1) == length($s2));
utf8::upgrade() and utf8::downgrade() are just ways of forcing the encoding format of the internal representation - useful for demonstrating bugs in modules which make assumptions. Note that they don't change the semantics of the string - perl thinks they still have the same length and codepoints. To continue the example above:
utf8::downgrade($s2); # the the 0x80 codepoint now stored as 1 byte ok(length($s1) == $length($s2)); ok($s1 eq $s2);
What an XS module does when it wants to process a list of bytes is of course up to the XS module's author. However, just using the current bytes of the internal representation is probably a poor choice - two strings which are semantically identical at the perl level but have different internal representations will give different results (e.g. the $s1 above initially and the $s2 after the chop()). If there is no sensible interpretation of the meaning of codepoints > 0xff then I would suggest the XS code should check the SVf_UTF8 flag and if present,try to downgrade the string, and if not possible, croak.

Dave.


In reply to Re: What does utf8::upgrade actually do. by dave_the_m
in thread What does utf8::upgrade actually do. by syphilis

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2022-05-23 13:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (82 votes). Check out past polls.

    Notices?