Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^3: Malformed UTF-8 character

by soonix (Chancellor)
on Dec 02, 2022 at 09:04 UTC ( [id://11148493]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Malformed UTF-8 character
in thread Malformed UTF-8 character

if you use utf8 only for strings and not for variable names, you could convert your special characters to \N{...} notation, such als "\N{EN DASH}"

Of course, it's your decision whether "Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk" is more readable than something like "Bj�rk" or not ;-)

N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit use charnames;

Replies are listed 'Best First'.
Re^4: Malformed UTF-8 character
by pryrt (Abbot) on Dec 02, 2022 at 15:34 UTC
    soonix wrote the following, in reply to BillKSmith,
    > if you use utf8 only for strings.... it's your decision

    Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a use utf8; in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma.

    The best would be if perlmonks would serve posts and [download]s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who [download]s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who [download]s the code from perlmonks to have to search through every piece of code they download from perlmonks that has use utf8; and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character.

    ¤: oneliner = perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl

Re^4: Malformed UTF-8 character
by BillKSmith (Monsignor) on Dec 02, 2022 at 17:08 UTC
    I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
    Bill

      G'day Bill,

      "I did not know what unicode character the \x96 was meant to represent."

      A quick way to determine this is via "Unicode Character Code Charts" — it has "Find chart by hex code:" near the top of the page.

      [Aside: Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts". I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions. Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time); if you're desperate for 15.0 support, it was added in v5.37.5 (or just wait for 5.38.0 to be released in May next year, or thereabouts).]

      That will give you the name, <control>, and the informative alias, START OF GUARDED AREA; you can use the latter in \N{}.

      $ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")' 96

      In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward. Compare:

      $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}' DIGIT FOUR $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} || +"<blank>"' <blank> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}' <control> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} || +"<blank>"' START OF GUARDED AREA

      — Ken

        My problem was that the \x96 was not the Unicode code-point, or even the utf8 encoding of the character. I now know that it is the cp1252 encoding of \N{EN DASH}. I had forgotten that there is such a thing as cp1252!
        Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148493]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2025-07-13 11:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.