Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: My UTF-8 text isn't surviving I/O as expected

by cavac (Prior)
on Nov 25, 2024 at 13:47 UTC ( [id://11162881]=note: print w/replies, xml ) Need Help??


in reply to My UTF-8 text isn't surviving I/O as expected

I decided to join the 21st century

Technically, you decided to finally join the late 20th century. The basics of Unicode were started in 1989, UTF-8 was first presented to the public at the USENIX conference in 1993. (Sorry, couldn't resist)

So, welcome to the international club of pain and suffering uh i meant to write "supporters of Umlauts, Linear A(¹), hidden control characters that will confuse your text renderer(²), black Santas(³). And multiple ways of encoding the same character with the same text length but different byte length that are still the same character but need special (and complicated) functions to string-compare them(4). And apparently broken superscripts on PerlMonks(5)"


(¹) Linear A

(²) Unicode control characters

(³) Emoji modifiers and examples in color

(4) Unicode equivalence, also incorrect length of strings with diphthongs

(5) PerlMonks only seems to display superscripts ¹²³ correctly (only tried in post preview), but should really support all the numbers and signs. Unicode Block “Superscripts and Subscripts”

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Also check out my sisters artwork and my weekly webcomics
  • Comment on Re: My UTF-8 text isn't surviving I/O as expected

Replies are listed 'Best First'.
Re^2: My UTF-8 text isn't surviving I/O as expected
by choroba (Cardinal) on Nov 25, 2024 at 15:24 UTC
    For me, ⁵ works without problems. Isn't it a browser/font problem?

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^2: My UTF-8 text isn't surviving I/O as expected
by ibm1620 (Hermit) on Nov 26, 2024 at 01:56 UTC
    Reading Tom Christiansen's sobering post about Unicode was enough to discourage me from trying to become proficient with Unicode. I'm retired so I get to do that :-)

      On the surface, yes, it looks bad. But from my experience, you can cover nearly all cases (like 99.5% or so) by following some simple rules, no matter the encoding:

      • Convert all incoming data to perls internal representation (utf8_decode or similar)
      • Convert all outgoing data to the correct encoding (utf8 or similar)
      • Unless you really have to verify very specific things in text, just treat it like a random binary blob.
      • 0 + $var works for converting text to numeric values.
      • If you do any type of string comparison in your code, always normalize both sides using Unicode::Normalize and always stick to the same normalization form.
      • Don't assume that any other text encoding standard is saner. Or even a global standard.

      The basic ugliness of Unicode (or other text encodings) stems not from their engineers but from the basic fact that human language is a complicated mess. And written language is still a somewhat new concept in human evolution and we are still trying to figure out the finer details. At least with Unicode, you don't have to constantly switch schemes depending on who is using your software.

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
      Also check out my sisters artwork and my weekly webcomics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11162881]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2025-12-08 22:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (88 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.