Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: What does utf8::upgrade actually do.

by Tux (Canon)
on Feb 18, 2021 at 09:21 UTC ( [id://11128513]=note: print w/replies, xml ) Need Help??


in reply to What does utf8::upgrade actually do.

Also keep in mind that a perl "string" does not need to be a single encoding for all of its content.

Think XML and CSV where parts can be real binary and parts can be encoded.

Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption.

One more thing to keep in mind with codepoints is that Unicode allows a lot.

e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as e1 b8 af, c3 af cc 81, c3 ad cc 88, 69 cc 81 cc 88, or 69 cc 88 cc 81, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes:

(update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f)

#!/usr/bin/perl use 5.18.2; use warnings; use Data::Peek; use Unicode::Normalize qw( normalize ); use Encode qw( encode decode ); use charnames qw(:full); sub dp { my ($tag, $dta) = @_; my $dp = DPeek ($dta); printf "%-6s: %-52s", $tag, $dp =~ s{^(\S+)\K}{" " x (26 - length +$1)}er; utf8::is_utf8 ($dta) and print join " + " => map { charnames::viacode (ord) } split // +=> $dta; say ""; } # dp $| = 1; foreach my $bytes ( "\xe1\xb8\xaf", "\xc3\xaf\xcc\x81", "\xc3\xad\xcc\x88", "\x69\xcc\x81\xcc\x88", "\x69\xcc\x88\xcc\x81", ) { my $u = decode ("utf-8", $bytes); dp ("Bytes", $bytes); dp ("UTF-8", $u); dp ("NF$_", normalize ($_, $u)) for qw( D C KD KC ); say ""; }

->

Bytes : PV("\341\270\257"\0) UTF-8 : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE Bytes : PV("\303\257\314\201"\0) UTF-8 : PV("\303\257\314\201"\0) [UTF8 "\x{ef}\x{301}"] LATIN SMAL +L LETTER I WITH DIAERESIS + COMBINING ACUTE ACCENT NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE Bytes : PV("\303\255\314\210"\0) UTF-8 : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS Bytes : PV("i\314\201\314\210"\0) UTF-8 : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS Bytes : PV("i\314\210\314\201"\0) UTF-8 : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE

Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11128513]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-03-28 08:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found