Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: Malformed UTF-8 character

by soonix (Canon)
on Dec 02, 2022 at 09:04 UTC ( [id://11148493]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Malformed UTF-8 character
in thread Malformed UTF-8 character

if you use utf8 only for strings and not for variable names, you could convert your special characters to \N{...} notation, such als "\N{EN DASH}"

Of course, it's your decision whether "Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk" is more readable than something like "Bj�rk" or not ;-)

N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit use charnames;

Replies are listed 'Best First'.
Re^4: Malformed UTF-8 character
by pryrt (Abbot) on Dec 02, 2022 at 15:34 UTC
    soonix wrote the following, in reply to BillKSmith,
    > if you use utf8 only for strings.... it's your decision

    Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a use utf8; in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma.

    The best would be if perlmonks would serve posts and [download]s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who [download]s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who [download]s the code from perlmonks to have to search through every piece of code they download from perlmonks that has use utf8; and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character.

    ¤: oneliner = perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl

Re^4: Malformed UTF-8 character
by BillKSmith (Monsignor) on Dec 02, 2022 at 17:08 UTC
    I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
    Bill

      G'day Bill,

      "I did not know what unicode character the \x96 was meant to represent."

      A quick way to determine this is via "Unicode Character Code Charts" — it has "Find chart by hex code:" near the top of the page.

      [Aside: Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts". I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions. Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time); if you're desperate for 15.0 support, it was added in v5.37.5 (or just wait for 5.38.0 to be released in May next year, or thereabouts).]

      That will give you the name, <control>, and the informative alias, START OF GUARDED AREA; you can use the latter in \N{}.

      $ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")' 96

      In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward. Compare:

      $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}' DIGIT FOUR $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} || +"<blank>"' <blank> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}' <control> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} || +"<blank>"' START OF GUARDED AREA

      — Ken

        My problem was that the \x96 was not the Unicode code-point, or even the utf8 encoding of the character. I now know that it is the cp1252 encoding of \N{EN DASH}. I had forgotten that there is such a thing as cp1252!
        Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148493]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-19 13:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found