Jim has asked for the wisdom of the Perl Monks concerning the following question:
I have this character encoding damaged text. It's gibberish, not Chinese.
敒›剕䕇呎 U+6552 CJK UNIFIED IDEOGRAPH-6552 U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK U+5255 CJK UNIFIED IDEOGRAPH-5255 U+4547 CJK UNIFIED IDEOGRAPH-4547 U+544E CJK UNIFIED IDEOGRAPH-544E
I know this is the original, undamaged text.
Re: URGENT
I've determined how the damage occurred. The original ten characters were ASCII (UTF-8), but they were mistakenly interpreted as UCS-2LE. Then they were petrified as five bogus characters (mojibake) in Unicode (UTF-8). This is essentially like what happens in the case of the infamous Bush hid the facts bug in Microsoft Notepad.
Here's the pattern.
R e : U R G E N T 52 65 3A 20 55 52 47 45 4E 54 U+6552 U+203A U+5255 U+4547 U+544E 敒 › 剕 䕇 呎
How can I reverse this character encoding damage using Perl? I tried using Encode::Repair, but I couldn't get it to work. It seems to me this repair job should be easily accomplished using pack/unpack, but those two functions have always confounded me. I need guidance.
UPDATE: Here's what I've managed to cobble together. It works, but I'm not impressed. Surely there's a better way.
use v5.16; use strict; use warnings; use utf8; binmode STDOUT, ':encoding(UTF-8)'; my $damaged_text = '敒›剕䕇呎'; my $repaired_text = ''; while ($damaged_text =~ m/(\X)/g) { my ($msb, $lsb) = unpack 'A2A2', sprintf "%04x", ord $1; $repaired_text .= chr(hex $lsb) . chr(hex $msb); } say $repaired_text; # Prints 'Re: URGENT'
(I had to use <pre> tags instead of <code> tags here because of the Chinese characters in the script.)