http://www.perlmonks.org?node_id=1034961


in reply to Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
in thread Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

Greetings, and thank you for your reply.
While this is nearly the same output I received running the Perl script I posted.
The script merely indicated that Unicode::UCD couldn't properly map "\x99" (0099) | "&8482;" (in Decimal),
to a Unicode symbol/entity. In all likelyhood, it was because the document wasn't properly encoded
(windows-1252-1|ISO-8859-1), instead of UTF-8|UTF8. I've examined enough of the documents
to know that they aren't "junk", but rather UTF-8 encoded files that weren't saved accordingly.
So, knowing that Perl is quite Unicode|UTF-8 savvy, I was hoping I could find
a way to let Perl discover it's current incorrect encoding -- say ISO-8859-1, and either
convert the embedded symbols to their Decimal equivalent, or, if it's safe, to save it as UTF-8.
In fact, after saving that same document as UTF-8, and running that script on it, caused
the script to emit that error. Reading that same document with the embedded symbols/characters in it, while being
ISO-8859-1 with that script emitted:
6 U+0009 GC=Cc CHARACTER TABULATION 2564 U+000A GC=Cc LINE FEED (LF) 25209 U+0020 GC=Zs SPACE 8436 U+0021 GC=Po EXCLAMATION MARK 167 U+0022 GC=Po QUOTATION MARK 35 U+0023 GC=Po NUMBER SIGN 7 U+0024 GC=Sc DOLLAR SIGN 1140 U+0025 GC=Po PERCENT SIGN 46 U+0026 GC=Po AMPERSAND 108 U+0027 GC=Po APOSTROPHE 134 U+0028 GC=Ps LEFT PARENTHESIS 134 U+0029 GC=Pe RIGHT PARENTHESIS 14 U+002A GC=Po ASTERISK 2751 U+002C GC=Po COMMA 439 U+002D GC=Pd HYPHEN-MINUS 1655 U+002E GC=Po FULL STOP 518 U+002F GC=Po SOLIDUS 73 U+0030 GC=Nd DIGIT ZERO 91 U+0031 GC=Nd DIGIT ONE 107 U+0032 GC=Nd DIGIT TWO 53 U+0033 GC=Nd DIGIT THREE 30 U+0034 GC=Nd DIGIT FOUR 49 U+0035 GC=Nd DIGIT FIVE 13 U+0036 GC=Nd DIGIT SIX 5 U+0037 GC=Nd DIGIT SEVEN 21 U+0038 GC=Nd DIGIT EIGHT 12 U+0039 GC=Nd DIGIT NINE 331 U+003A GC=Po COLON 43 U+003B GC=Po SEMICOLON 714 U+003C GC=Sm LESS-THAN SIGN 2176 U+003D GC=Sm EQUALS SIGN 2853 U+003E GC=Sm GREATER-THAN SIGN 103 U+003F GC=Po QUESTION MARK 4 U+0040 GC=Po COMMERCIAL AT 665 U+0041 GC=Lu LATIN CAPITAL LETTER A 547 U+0042 GC=Lu LATIN CAPITAL LETTER B 370 U+0043 GC=Lu LATIN CAPITAL LETTER C 331 U+0044 GC=Lu LATIN CAPITAL LETTER D 625 U+0045 GC=Lu LATIN CAPITAL LETTER E 323 U+0046 GC=Lu LATIN CAPITAL LETTER F 104 U+0047 GC=Lu LATIN CAPITAL LETTER G 171 U+0048 GC=Lu LATIN CAPITAL LETTER H 509 U+0049 GC=Lu LATIN CAPITAL LETTER I 32 U+004A GC=Lu LATIN CAPITAL LETTER J 83 U+004B GC=Lu LATIN CAPITAL LETTER K 378 U+004C GC=Lu LATIN CAPITAL LETTER L 594 U+004D GC=Lu LATIN CAPITAL LETTER M 520 U+004E GC=Lu LATIN CAPITAL LETTER N 410 U+004F GC=Lu LATIN CAPITAL LETTER O 653 U+0050 GC=Lu LATIN CAPITAL LETTER P 39 U+0051 GC=Lu LATIN CAPITAL LETTER Q 623 U+0052 GC=Lu LATIN CAPITAL LETTER R 564 U+0053 GC=Lu LATIN CAPITAL LETTER S 912 U+0054 GC=Lu LATIN CAPITAL LETTER T 486 U+0055 GC=Lu LATIN CAPITAL LETTER U 89 U+0056 GC=Lu LATIN CAPITAL LETTER V 196 U+0057 GC=Lu LATIN CAPITAL LETTER W 8 U+0058 GC=Lu LATIN CAPITAL LETTER X 394 U+0059 GC=Lu LATIN CAPITAL LETTER Y 4 U+005A GC=Lu LATIN CAPITAL LETTER Z 21 U+005B GC=Ps LEFT SQUARE BRACKET 21 U+005D GC=Pe RIGHT SQUARE BRACKET 5 U+005E GC=Sk CIRCUMFLEX ACCENT 4766 U+005F GC=Pc LOW LINE 10143 U+0061 GC=Ll LATIN SMALL LETTER A 2570 U+0062 GC=Ll LATIN SMALL LETTER B 4103 U+0063 GC=Ll LATIN SMALL LETTER C 4907 U+0064 GC=Ll LATIN SMALL LETTER D 16937 U+0065 GC=Ll LATIN SMALL LETTER E 2591 U+0066 GC=Ll LATIN SMALL LETTER F 2564 U+0067 GC=Ll LATIN SMALL LETTER G 3859 U+0068 GC=Ll LATIN SMALL LETTER H 9548 U+0069 GC=Ll LATIN SMALL LETTER I 87 U+006A GC=Ll LATIN SMALL LETTER J 502 U+006B GC=Ll LATIN SMALL LETTER K 6444 U+006C GC=Ll LATIN SMALL LETTER L 4640 U+006D GC=Ll LATIN SMALL LETTER M 7574 U+006E GC=Ll LATIN SMALL LETTER N 10936 U+006F GC=Ll LATIN SMALL LETTER O 4417 U+0070 GC=Ll LATIN SMALL LETTER P 4481 U+0071 GC=Ll LATIN SMALL LETTER Q 10310 U+0072 GC=Ll LATIN SMALL LETTER R 10046 U+0073 GC=Ll LATIN SMALL LETTER S 11385 U+0074 GC=Ll LATIN SMALL LETTER T 4523 U+0075 GC=Ll LATIN SMALL LETTER U 1888 U+0076 GC=Ll LATIN SMALL LETTER V 1574 U+0077 GC=Ll LATIN SMALL LETTER W 537 U+0078 GC=Ll LATIN SMALL LETTER X 2773 U+0079 GC=Ll LATIN SMALL LETTER Y 80 U+007A GC=Ll LATIN SMALL LETTER Z 19 U+007B GC=Ps LEFT CURLY BRACKET 10 U+007C GC=Sm VERTICAL LINE 19 U+007D GC=Pe RIGHT CURLY BRACKET 207 U+007E GC=Sm TILDE 55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement> 3 U+00A0 GC=Zs NO-BREAK SPACE
(55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement>)
being the offending symbol/character.
Anyway, I see you've provided some other possibilities. So I'd probably do well to further investigate them.

Thanks again, for taking the time to respond.

--chris

#!/usr/bin/perl -Tw
use perl::always;
my perl_version = "5.12.4";
print $perl_version;