soonix wrote the following, in reply to BillKSmith,
> if you use utf8 only for strings.... it's your decision
Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a use utf8; in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma.
The best would be if perlmonks would serve posts and [download]s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who [download]s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who [download]s the code from perlmonks to have to search through every piece of code they download from perlmonks that has use utf8; and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character.
¤: oneliner = perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl
| [reply] [d/l] [select] |
I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
| [reply] |
G'day Bill,
"I did not know what unicode character the \x96 was meant to represent."
A quick way to determine this is via "Unicode Character Code Charts" —
it has "Find chart by hex code:" near the top of the page.
[Aside:
Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts".
I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions.
Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time);
if you're desperate for 15.0 support, it was added in
v5.37.5
(or just wait for 5.38.0 to be released in May next year, or thereabouts).]
That will give you the name, <control>, and the informative alias, START OF GUARDED AREA;
you can use the latter in \N{}.
$ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")'
96
In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward.
Compare:
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}'
DIGIT FOUR
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} ||
+"<blank>"'
<blank>
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}'
<control>
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} ||
+"<blank>"'
START OF GUARDED AREA
| [reply] [d/l] [select] |
My problem was that the \x96 was not the Unicode code-point, or even
the utf8 encoding of the character. I now know that it is the cp1252 encoding of \N{EN DASH}. I had forgotten that there is such a thing as cp1252!
| [reply] |