Re^3: Malformed UTF-8 character

Replies are listed 'Best First'.
Re^4: Malformed UTF-8 character by pryrt (Abbot) on Dec 02, 2022 at 15:34 UTC
soonix wrote the following, in reply to BillKSmith, > if you `use utf8` only for strings.... it's your decision Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a `use utf8;` in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma. The best would be if perlmonks would serve posts and `[download]`s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who `[download]`s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who `[download]`s the code from perlmonks to have to search through every piece of code they download from perlmonks that has `use utf8;` and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character. ¤: oneliner = `perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl`	[reply] [d/l] [select]
Re^4: Malformed UTF-8 character by BillKSmith (Monsignor) on Dec 02, 2022 at 17:08 UTC
I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent. Bill	[reply]
Re^5: Malformed UTF-8 character by kcott (Archbishop) on Dec 03, 2022 at 04:45 UTC
G'day Bill, "I did not know what unicode character the \x96 was meant to represent." A quick way to determine this is via "Unicode Character Code Charts" — it has "Find chart by hex code:" near the top of the page. [Aside: Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts". I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions. Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time); if you're desperate for 15.0 support, it was added in v5.37.5 (or just wait for 5.38.0 to be released in May next year, or thereabouts).] That will give you the name, `<control>`, and the informative alias, `START OF GUARDED AREA`; you can use the latter in `\N{}`. `$ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")' 96` [download] In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward. Compare: `$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}' DIGIT FOUR $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} \|\| +"<blank>"' <blank> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}' <control> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} \|\| +"<blank>"' START OF GUARDED AREA` [download] — Ken	[reply] [d/l] [select]
Re^6: Malformed UTF-8 character by BillKSmith (Monsignor) on Dec 03, 2022 at 13:39 UTC
My problem was that the \x96 was not the Unicode code-point, or even the utf8 encoding of the character. I now know that it is the cp1252 encoding of \N{EN DASH}. I had forgotten that there is such a thing as cp1252! Bill	[reply]


"be consistent"
	PerlMonks