Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

by Anonymous Monk
on May 23, 2013 at 06:40 UTC ( #1034881=note: print w/ replies, xml ) Need Help??


in reply to Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

Surely I'm not the only one that's had to overcome something like this.

No you aren't :) I've been there, there is no canned solution :) disable warnings: utf8 "\x8E" does not map to Unicode ( NOTE: malformed corrupted double encoded Encoding::FixLatin / fix_latin / Encode::DoubleEncodedUTF8 / Encode::Repair / Encode::Detective / Encode::Guess , unicode chcp cp1252 windows-1252 cp437 iso-8859-1 )

I did spend quite some time trying to find a solution reading all the perldoc's.

There wouldn't be a solution -- maybe if its been corrupted once, but more than once and you've got junk with no way back

Fixing broken character encoding
gibberish detection
Encode::Detective - detect a data encoding
Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one
Encode::Repair - Repair wrongly encoded text strings


Comment on Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
Re^2: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by taint (Chaplain) on May 23, 2013 at 14:30 UTC
    Greetings, and thank you for your reply.
    While this is nearly the same output I received running the Perl script I posted.
    The script merely indicated that Unicode::UCD couldn't properly map "\x99" (0099) | "&8482;" (in Decimal),
    to a Unicode symbol/entity. In all likelyhood, it was because the document wasn't properly encoded
    (windows-1252-1|ISO-8859-1), instead of UTF-8|UTF8. I've examined enough of the documents
    to know that they aren't "junk", but rather UTF-8 encoded files that weren't saved accordingly.
    So, knowing that Perl is quite Unicode|UTF-8 savvy, I was hoping I could find
    a way to let Perl discover it's current incorrect encoding -- say ISO-8859-1, and either
    convert the embedded symbols to their Decimal equivalent, or, if it's safe, to save it as UTF-8.
    In fact, after saving that same document as UTF-8, and running that script on it, caused
    the script to emit that error. Reading that same document with the embedded symbols/characters in it, while being
    ISO-8859-1 with that script emitted:
    6 U+0009 GC=Cc CHARACTER TABULATION 2564 U+000A GC=Cc LINE FEED (LF) 25209 U+0020 GC=Zs SPACE 8436 U+0021 GC=Po EXCLAMATION MARK 167 U+0022 GC=Po QUOTATION MARK 35 U+0023 GC=Po NUMBER SIGN 7 U+0024 GC=Sc DOLLAR SIGN 1140 U+0025 GC=Po PERCENT SIGN 46 U+0026 GC=Po AMPERSAND 108 U+0027 GC=Po APOSTROPHE 134 U+0028 GC=Ps LEFT PARENTHESIS 134 U+0029 GC=Pe RIGHT PARENTHESIS 14 U+002A GC=Po ASTERISK 2751 U+002C GC=Po COMMA 439 U+002D GC=Pd HYPHEN-MINUS 1655 U+002E GC=Po FULL STOP 518 U+002F GC=Po SOLIDUS 73 U+0030 GC=Nd DIGIT ZERO 91 U+0031 GC=Nd DIGIT ONE 107 U+0032 GC=Nd DIGIT TWO 53 U+0033 GC=Nd DIGIT THREE 30 U+0034 GC=Nd DIGIT FOUR 49 U+0035 GC=Nd DIGIT FIVE 13 U+0036 GC=Nd DIGIT SIX 5 U+0037 GC=Nd DIGIT SEVEN 21 U+0038 GC=Nd DIGIT EIGHT 12 U+0039 GC=Nd DIGIT NINE 331 U+003A GC=Po COLON 43 U+003B GC=Po SEMICOLON 714 U+003C GC=Sm LESS-THAN SIGN 2176 U+003D GC=Sm EQUALS SIGN 2853 U+003E GC=Sm GREATER-THAN SIGN 103 U+003F GC=Po QUESTION MARK 4 U+0040 GC=Po COMMERCIAL AT 665 U+0041 GC=Lu LATIN CAPITAL LETTER A 547 U+0042 GC=Lu LATIN CAPITAL LETTER B 370 U+0043 GC=Lu LATIN CAPITAL LETTER C 331 U+0044 GC=Lu LATIN CAPITAL LETTER D 625 U+0045 GC=Lu LATIN CAPITAL LETTER E 323 U+0046 GC=Lu LATIN CAPITAL LETTER F 104 U+0047 GC=Lu LATIN CAPITAL LETTER G 171 U+0048 GC=Lu LATIN CAPITAL LETTER H 509 U+0049 GC=Lu LATIN CAPITAL LETTER I 32 U+004A GC=Lu LATIN CAPITAL LETTER J 83 U+004B GC=Lu LATIN CAPITAL LETTER K 378 U+004C GC=Lu LATIN CAPITAL LETTER L 594 U+004D GC=Lu LATIN CAPITAL LETTER M 520 U+004E GC=Lu LATIN CAPITAL LETTER N 410 U+004F GC=Lu LATIN CAPITAL LETTER O 653 U+0050 GC=Lu LATIN CAPITAL LETTER P 39 U+0051 GC=Lu LATIN CAPITAL LETTER Q 623 U+0052 GC=Lu LATIN CAPITAL LETTER R 564 U+0053 GC=Lu LATIN CAPITAL LETTER S 912 U+0054 GC=Lu LATIN CAPITAL LETTER T 486 U+0055 GC=Lu LATIN CAPITAL LETTER U 89 U+0056 GC=Lu LATIN CAPITAL LETTER V 196 U+0057 GC=Lu LATIN CAPITAL LETTER W 8 U+0058 GC=Lu LATIN CAPITAL LETTER X 394 U+0059 GC=Lu LATIN CAPITAL LETTER Y 4 U+005A GC=Lu LATIN CAPITAL LETTER Z 21 U+005B GC=Ps LEFT SQUARE BRACKET 21 U+005D GC=Pe RIGHT SQUARE BRACKET 5 U+005E GC=Sk CIRCUMFLEX ACCENT 4766 U+005F GC=Pc LOW LINE 10143 U+0061 GC=Ll LATIN SMALL LETTER A 2570 U+0062 GC=Ll LATIN SMALL LETTER B 4103 U+0063 GC=Ll LATIN SMALL LETTER C 4907 U+0064 GC=Ll LATIN SMALL LETTER D 16937 U+0065 GC=Ll LATIN SMALL LETTER E 2591 U+0066 GC=Ll LATIN SMALL LETTER F 2564 U+0067 GC=Ll LATIN SMALL LETTER G 3859 U+0068 GC=Ll LATIN SMALL LETTER H 9548 U+0069 GC=Ll LATIN SMALL LETTER I 87 U+006A GC=Ll LATIN SMALL LETTER J 502 U+006B GC=Ll LATIN SMALL LETTER K 6444 U+006C GC=Ll LATIN SMALL LETTER L 4640 U+006D GC=Ll LATIN SMALL LETTER M 7574 U+006E GC=Ll LATIN SMALL LETTER N 10936 U+006F GC=Ll LATIN SMALL LETTER O 4417 U+0070 GC=Ll LATIN SMALL LETTER P 4481 U+0071 GC=Ll LATIN SMALL LETTER Q 10310 U+0072 GC=Ll LATIN SMALL LETTER R 10046 U+0073 GC=Ll LATIN SMALL LETTER S 11385 U+0074 GC=Ll LATIN SMALL LETTER T 4523 U+0075 GC=Ll LATIN SMALL LETTER U 1888 U+0076 GC=Ll LATIN SMALL LETTER V 1574 U+0077 GC=Ll LATIN SMALL LETTER W 537 U+0078 GC=Ll LATIN SMALL LETTER X 2773 U+0079 GC=Ll LATIN SMALL LETTER Y 80 U+007A GC=Ll LATIN SMALL LETTER Z 19 U+007B GC=Ps LEFT CURLY BRACKET 10 U+007C GC=Sm VERTICAL LINE 19 U+007D GC=Pe RIGHT CURLY BRACKET 207 U+007E GC=Sm TILDE 55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement> 3 U+00A0 GC=Zs NO-BREAK SPACE
    (55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement>)
    being the offending symbol/character.
    Anyway, I see you've provided some other possibilities. So I'd probably do well to further investigate them.

    Thanks again, for taking the time to respond.

    --chris

    #!/usr/bin/perl -Tw
    use perl::always;
    my perl_version = "5.12.4";
    print $perl_version;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1034881]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2014-12-19 14:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (84 votes), past polls