Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^4: Malformed UTF-8 character

by BillKSmith (Monsignor)
on Dec 02, 2022 at 17:08 UTC ( [id://11148507]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Malformed UTF-8 character
in thread Malformed UTF-8 character

I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
Bill

Replies are listed 'Best First'.
Re^5: Malformed UTF-8 character
by kcott (Archbishop) on Dec 03, 2022 at 04:45 UTC

    G'day Bill,

    "I did not know what unicode character the \x96 was meant to represent."

    A quick way to determine this is via "Unicode Character Code Charts" — it has "Find chart by hex code:" near the top of the page.

    [Aside: Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts". I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions. Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time); if you're desperate for 15.0 support, it was added in v5.37.5 (or just wait for 5.38.0 to be released in May next year, or thereabouts).]

    That will give you the name, <control>, and the informative alias, START OF GUARDED AREA; you can use the latter in \N{}.

    $ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")' 96

    In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward. Compare:

    $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}' DIGIT FOUR $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} || +"<blank>"' <blank> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}' <control> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} || +"<blank>"' START OF GUARDED AREA

    — Ken

      My problem was that the \x96 was not the Unicode code-point, or even the utf8 encoding of the character. I now know that it is the cp1252 encoding of \N{EN DASH}. I had forgotten that there is such a thing as cp1252!
      Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148507]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2025-07-16 11:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.