Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

by Anonymous Monk
on May 23, 2013 at 06:40 UTC ( #1034881=note: print w/ replies, xml ) Need Help??


in reply to Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

Surely I'm not the only one that's had to overcome something like this.

No you aren't :) I've been there, there is no canned solution :) disable warnings: utf8 "\x8E" does not map to Unicode ( NOTE: malformed corrupted double encoded Encoding::FixLatin / fix_latin / Encode::DoubleEncodedUTF8 / Encode::Repair / Encode::Detective / Encode::Guess , unicode chcp cp1252 windows-1252 cp437 iso-8859-1 )

I did spend quite some time trying to find a solution reading all the perldoc's.

There wouldn't be a solution -- maybe if its been corrupted once, but more than once and you've got junk with no way back

Fixing broken character encoding
gibberish detection
Encode::Detective - detect a data encoding
Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one
Encode::Repair - Repair wrongly encoded text strings


Comment on Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
Re^2: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by taint (Chaplain) on May 23, 2013 at 14:30 UTC
    Greetings, and thank you for your reply.
    While this is nearly the same output I received running the Perl script I posted.
    The script merely indicated that Unicode::UCD couldn't properly map "\x99" (0099) | "&8482;" (in Decimal),
    to a Unicode symbol/entity. In all likelyhood, it was because the document wasn't properly encoded
    (windows-1252-1|ISO-8859-1), instead of UTF-8|UTF8. I've examined enough of the documents
    to know that they aren't "junk", but rather UTF-8 encoded files that weren't saved accordingly.
    So, knowing that Perl is quite Unicode|UTF-8 savvy, I was hoping I could find
    a way to let Perl discover it's current incorrect encoding -- say ISO-8859-1, and either
    convert the embedded symbols to their Decimal equivalent, or, if it's safe, to save it as UTF-8.
    In fact, after saving that same document as UTF-8, and running that script on it, caused
    the script to emit that error. Reading that same document with the embedded symbols/characters in it, while being
    ISO-8859-1 with that script emitted:
    6 U+0009 GC=Cc CHARACTER TABULATION 2564 U+000A GC=Cc LINE FEED (LF) 25209 U+0020 GC=Zs SPACE 8436 U+0021 GC=Po EXCLAMATION MARK 167 U+0022 GC=Po QUOTATION MARK 35 U+0023 GC=Po NUMBER SIGN 7 U+0024 GC=Sc DOLLAR SIGN 1140 U+0025 GC=Po PERCENT SIGN 46 U+0026 GC=Po AMPERSAND 108 U+0027 GC=Po APOSTROPHE 134 U+0028 GC=Ps LEFT PARENTHESIS 134 U+0029 GC=Pe RIGHT PARENTHESIS 14 U+002A GC=Po ASTERISK 2751 U+002C GC=Po COMMA 439 U+002D GC=Pd HYPHEN-MINUS 1655 U+002E GC=Po FULL STOP 518 U+002F GC=Po SOLIDUS 73 U+0030 GC=Nd DIGIT ZERO 91 U+0031 GC=Nd DIGIT ONE 107 U+0032 GC=Nd DIGIT TWO 53 U+0033 GC=Nd DIGIT THREE 30 U+0034 GC=Nd DIGIT FOUR 49 U+0035 GC=Nd DIGIT FIVE 13 U+0036 GC=Nd DIGIT SIX 5 U+0037 GC=Nd DIGIT SEVEN 21 U+0038 GC=Nd DIGIT EIGHT 12 U+0039 GC=Nd DIGIT NINE 331 U+003A GC=Po COLON 43 U+003B GC=Po SEMICOLON 714 U+003C GC=Sm LESS-THAN SIGN 2176 U+003D GC=Sm EQUALS SIGN 2853 U+003E GC=Sm GREATER-THAN SIGN 103 U+003F GC=Po QUESTION MARK 4 U+0040 GC=Po COMMERCIAL AT 665 U+0041 GC=Lu LATIN CAPITAL LETTER A 547 U+0042 GC=Lu LATIN CAPITAL LETTER B 370 U+0043 GC=Lu LATIN CAPITAL LETTER C 331 U+0044 GC=Lu LATIN CAPITAL LETTER D 625 U+0045 GC=Lu LATIN CAPITAL LETTER E 323 U+0046 GC=Lu LATIN CAPITAL LETTER F 104 U+0047 GC=Lu LATIN CAPITAL LETTER G 171 U+0048 GC=Lu LATIN CAPITAL LETTER H 509 U+0049 GC=Lu LATIN CAPITAL LETTER I 32 U+004A GC=Lu LATIN CAPITAL LETTER J 83 U+004B GC=Lu LATIN CAPITAL LETTER K 378 U+004C GC=Lu LATIN CAPITAL LETTER L 594 U+004D GC=Lu LATIN CAPITAL LETTER M 520 U+004E GC=Lu LATIN CAPITAL LETTER N 410 U+004F GC=Lu LATIN CAPITAL LETTER O 653 U+0050 GC=Lu LATIN CAPITAL LETTER P 39 U+0051 GC=Lu LATIN CAPITAL LETTER Q 623 U+0052 GC=Lu LATIN CAPITAL LETTER R 564 U+0053 GC=Lu LATIN CAPITAL LETTER S 912 U+0054 GC=Lu LATIN CAPITAL LETTER T 486 U+0055 GC=Lu LATIN CAPITAL LETTER U 89 U+0056 GC=Lu LATIN CAPITAL LETTER V 196 U+0057 GC=Lu LATIN CAPITAL LETTER W 8 U+0058 GC=Lu LATIN CAPITAL LETTER X 394 U+0059 GC=Lu LATIN CAPITAL LETTER Y 4 U+005A GC=Lu LATIN CAPITAL LETTER Z 21 U+005B GC=Ps LEFT SQUARE BRACKET 21 U+005D GC=Pe RIGHT SQUARE BRACKET 5 U+005E GC=Sk CIRCUMFLEX ACCENT 4766 U+005F GC=Pc LOW LINE 10143 U+0061 GC=Ll LATIN SMALL LETTER A 2570 U+0062 GC=Ll LATIN SMALL LETTER B 4103 U+0063 GC=Ll LATIN SMALL LETTER C 4907 U+0064 GC=Ll LATIN SMALL LETTER D 16937 U+0065 GC=Ll LATIN SMALL LETTER E 2591 U+0066 GC=Ll LATIN SMALL LETTER F 2564 U+0067 GC=Ll LATIN SMALL LETTER G 3859 U+0068 GC=Ll LATIN SMALL LETTER H 9548 U+0069 GC=Ll LATIN SMALL LETTER I 87 U+006A GC=Ll LATIN SMALL LETTER J 502 U+006B GC=Ll LATIN SMALL LETTER K 6444 U+006C GC=Ll LATIN SMALL LETTER L 4640 U+006D GC=Ll LATIN SMALL LETTER M 7574 U+006E GC=Ll LATIN SMALL LETTER N 10936 U+006F GC=Ll LATIN SMALL LETTER O 4417 U+0070 GC=Ll LATIN SMALL LETTER P 4481 U+0071 GC=Ll LATIN SMALL LETTER Q 10310 U+0072 GC=Ll LATIN SMALL LETTER R 10046 U+0073 GC=Ll LATIN SMALL LETTER S 11385 U+0074 GC=Ll LATIN SMALL LETTER T 4523 U+0075 GC=Ll LATIN SMALL LETTER U 1888 U+0076 GC=Ll LATIN SMALL LETTER V 1574 U+0077 GC=Ll LATIN SMALL LETTER W 537 U+0078 GC=Ll LATIN SMALL LETTER X 2773 U+0079 GC=Ll LATIN SMALL LETTER Y 80 U+007A GC=Ll LATIN SMALL LETTER Z 19 U+007B GC=Ps LEFT CURLY BRACKET 10 U+007C GC=Sm VERTICAL LINE 19 U+007D GC=Pe RIGHT CURLY BRACKET 207 U+007E GC=Sm TILDE 55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement> 3 U+00A0 GC=Zs NO-BREAK SPACE
    (55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement>)
    being the offending symbol/character.
    Anyway, I see you've provided some other possibilities. So I'd probably do well to further investigate them.

    Thanks again, for taking the time to respond.

    --chris

    #!/usr/bin/perl -Tw
    use perl::always;
    my perl_version = "5.12.4";
    print $perl_version;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1034881]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (8)
As of 2015-07-02 04:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (27 votes), past polls