Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Is there some universal Unicode+UTF8 switch?

by haj (Chaplain)
on Sep 02, 2019 at 09:01 UTC ( #11105412=note: print w/replies, xml ) Need Help??


in reply to Re: Is there some universal Unicode+UTF8 switch?
in thread Is there some universal Unicode+UTF8 switch?

A good compilation, to which I'd like to add (or maybe in the cases 6 and 7 just expand on your fourth entry):

5) @ARGV and %ENV. utf8::all does convert @ARGV but does not touch %ENV. Good luck when you're on Windows where per default the terminal doesn't use UTF-8 encoding.

6) Database fields. Encoding of these is often defined outside of the Perl world. The driver docs should tell you how to handle encoding.

7) Evaluating binary data in your program: Unzipping compressed data, decrypting secret stuff, and parsing ASN.1 or image metadata may all return (encoded) texts.

I'm also not too happy with using Devel::Peek for debugging encoding issues. It provides too much useless information, sometimes misleading, and is difficult to read (like PV = 0x5629d24aaa30 "\303\244"\0 [UTF8 "\x{e4}"] versus PV = 0x5629d2414060 "\344"\0). I'd rather write suspicious strings to a file, using UTF-8 encoding, and examine this file with an editor which is capable of UTF-8 and hex display.

I'm also using some regular expressions in debugging:

my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; sub contains_decodable_utf8 { $_[0] =~ /$utf8_decodable_regex/; } sub is_utf8_decodable { $_[0] =~ /\A($utf8_decodable_regex|[[:ascii:]])*\z/; }
  • If contains_decodable_utf8($string) is false, then you should be fine.
  • If is_utf8_decodable($string) is true, then you can (and should) decode the string.
  • If contains_decodable_utf8($string) is true but is_utf8_decodable($string) is false, then you either have binary data (which might be just fine) or you have already mixed up encodings. Go back in your code and check what you did to $string before.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11105412]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2020-04-07 17:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (43 votes). Check out past polls.

    Notices?