How to Fix Character Encoding Damaged Text Using Perl?

Jim has asked for the wisdom of the Perl Monks concerning the following question:

I have this character encoding damaged text. It's gibberish, not Chinese.

    敒›剕䕇呎

    U+6552  CJK UNIFIED IDEOGRAPH-6552
    U+203A  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    U+5255  CJK UNIFIED IDEOGRAPH-5255
    U+4547  CJK UNIFIED IDEOGRAPH-4547
    U+544E  CJK UNIFIED IDEOGRAPH-544E

I know this is the original, undamaged text.

    Re: URGENT

I've determined how the damage occurred. The original ten characters were ASCII (UTF-8), but they were mistakenly interpreted as UCS-2LE. Then they were petrified as five bogus characters (mojibake) in Unicode (UTF-8). This is essentially like what happens in the case of the infamous Bush hid the facts bug in Microsoft Notepad.

Here's the pattern.

    R   e   :       U   R   G   E   N   T
    52  65  3A  20  55  52  47  45  4E  54
    U+6552  U+203A  U+5255  U+4547  U+544E
    敒      ›       剕      䕇      呎

How can I reverse this character encoding damage using Perl? I tried using Encode::Repair, but I couldn't get it to work. It seems to me this repair job should be easily accomplished using pack/unpack, but those two functions have always confounded me. I need guidance.

UPDATE: Here's what I've managed to cobble together. It works, but I'm not impressed. Surely there's a better way.

use v5.16;
use strict;
use warnings;
use utf8;

binmode STDOUT, ':encoding(UTF-8)';

my $damaged_text = '敒›剕䕇呎';

my $repaired_text = '';

while ($damaged_text =~ m/(\X)/g) {
    my ($msb, $lsb) = unpack 'A2A2', sprintf "%04x", ord $1;
    $repaired_text .= chr(hex $lsb) . chr(hex $msb);
}

say $repaired_text; # Prints 'Re: URGENT'

(I had to use <pre> tags instead of <code> tags here because of the Chinese characters in the script.)

Comment on How to Fix Character Encoding Damaged Text Using Perl?

Replies are listed 'Best First'.
Re: How to Fix Character Encoding Damaged Text Using Perl? by Your Mother (Archbishop) on Jun 15, 2013 at 05:51 UTC
If it's really that straightforward, this kind of reverse pipe-line might do– `perl -CSD -MEncode -le '$moji = decode "UCS-2LE", "Re: URGENT"; print $moji; print decode "UTF-8", encode "UCS-2LE", $moji'` 敒›剕䕇呎 Re: URGENT	[reply] [d/l]
Re^2: How to Fix Character Encoding Damaged Text Using Perl? by Jim (Curate) on Jun 15, 2013 at 20:48 UTC
Thank you very much, Your Mother! This works wonderfully, and it makes sense given the known cause of the encoding corruption. use v5.16; use utf8; use open qw( :encoding(UTF-8) :std ); use Encode qw( encode decode ); while (my $damaged_text = <DATA>) { chomp $damaged_text; my $repaired_text = decode('UTF-8', encode('UCS-2LE', $damaged_text)); say $repaired_text; } close DATA; exit 0; __DATA__ 敒›剕䕇呎敌馀⁳潧琠⁯韦겜鯥↽ This prints… Re: URGENT Let’s go to 日本国! …as expected. In this case, `use utf8` is required.	[reply] [d/l]
Re: How to Fix Character Encoding Damaged Text Using Perl? by moritz (Cardinal) on Jun 15, 2013 at 07:49 UTC
Since you already know what sequence of encoding and decoding lead to the broken output, the easiest way with Encode::Repair is this: use 5.010; use strict; use warnings; use Encode::Repair qw(repair_encoding); my $broken = '敒›剕䕇呎'; say repair_encoding($broken, [decode => 'utf-8', encode => 'UTF-16LE']); __END__ # output: Re: URGENT But it also works with learn_recoding: use 5.010; use strict; use warnings; use Encode::Repair qw(repair_encoding learn_recoding); binmode STDOUT, ':encoding(UTF-8)'; my $broken = '敒›剕䕇呎'; my $pattern = learn_recoding( from => $broken, to => 'Re: URGENT', encodings => ['UTF-8', 'UTF-16LE', 'UTF-16BE'], ); if ($pattern) { say repair_encoding($broken, $pattern); } So, what did you try? (Updated to use pre tags instead of code, because code tags badly break most non-ASCII-chars. Perl 6 - the future is here, just unevenly distributed	[reply]
Re^2: How to Fix Character Encoding Damaged Text Using Perl? by Jim (Curate) on Jun 15, 2013 at 18:37 UTC
Thank you very much, moritz, for your helpful reply. I greatly appreciate it. So, what did you try? I tried variations of something very similar to your second example using `Encode::Repair::learn_recoding()`. In hindsight, I believe the problem that thwarted my efforts was my inclusion of `use utf8` in the script. Needless to say, I thought I was doing the right thing when, in fact, I was doing exactly the wrong thing. I just tested using Encode::Repair to repair damaged text that has non-ASCII characters in it. 敌馀⁳潧琠⁯韦겜鯥↽ \u654C\uE274\u9980\u2073\u6F67\u7420\u206F\u97E6\uE6A5\uAC9C\u9BE5\u21BD \x4C\x65\x74\xE2\x80\x99\x73\x20\x67\x6F\x20\x74\x6F\x20\xE6\x97\xA5\xE6\x9C\xAC\xE5\x9B\xBD\x21 \u004C\u0065\u0074\u2019\u0073\u0020\u0067\u006F\u0020\u0074\u006F\u0020\u65E5\u672C\u56FD\u0021 Let’s go to 日本国! It works! Here's the script. use v5.16; use Encode::Repair qw( repair_encoding ); while (my $damaged_text = <DATA>) { chomp $damaged_text; my $repaired_text = repair_encoding( $damaged_text, [ decode => 'UTF-8', encode => 'UCS-2LE', ] ); say $repaired_text; } exit 0; __DATA__ 敒›剕䕇呎敌馀⁳潧琠⁯韦겜鯥↽ Brilliantly, this prints… Re: URGENT Let’s go to 日本国! Notice, however, that I had to remove `binmode STDOUT, ':encoding(UTF-8)'`. This version of the script also works. use v5.16; use Encode qw( decode ); use Encode::Repair qw( repair_encoding ); binmode STDOUT, ':encoding(UTF-8)'; while (my $damaged_text = <DATA>) { chomp $damaged_text; my $repaired_text = repair_encoding( $damaged_text, [ decode => 'UTF-8', encode => 'UCS-2LE', ] ); say decode('UTF-8', $repaired_text); } exit 0; __DATA__ 敒›剕䕇呎敌馀⁳潧琠⁯韦겜鯥↽ I can study and study and study the Perl documentation, but I'll never grok the subtleties of Perl's Unicode model. It's simply too profoundly confusing for me. I love your module, moritz! Thanks for it, and for your help here.	[reply] [d/l] [select]
Re: How to Fix Character Encoding Damaged Text Using Perl? by BrowserUk (Patriarch) on Jun 15, 2013 at 06:56 UTC
You often keep such text -- without a newline -- in a file on its own? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: How to Fix Character Encoding Damaged Text Using Perl? by Jim (Curate) on Jun 15, 2013 at 21:01 UTC
I work in digital forensics and electronic discovery. In the real world, sh*t happens. I often have to wrestle with damaged evidence of all kinds. Ignoring it is not an option. But I can't just ask the bad guys to recreate the incriminating evidence for me—this time, please, without all the nasty character encoding damaged text in it. So, yes, I often keep such text without a newline in a file on its own.	[reply]
Re: How to Fix Character Encoding Damaged Text Using Perl? by Anonymous Monk on Jun 15, 2013 at 02:29 UTC
It's gibberish, not Chinese. Then post the BYTES not unicode codepoints	[reply]
Re^2: How to Fix Character Encoding Damaged Text Using Perl? by Jim (Curate) on Jun 15, 2013 at 02:51 UTC
I posted the Unicode code points and character names to help illustrate how the character encoding damage occurred; that is, to demonstrate the pattern. My problem is straightforward: Using Perl, restore the damaged text '敒›剕䕇呎' to the original text 'Re: URGENT'.	[reply]
Re^2: How to Fix Character Encoding Damaged Text Using Perl? by Anonymous Monk on Jun 15, 2013 at 02:32 UTC
Whoops :) `my $bytes = encode( 'UTF-8', $perlunicodestring ); $bytes =~ s{\x52\x65\x3A\x20\x55\x52\x47\x45\x4E\x54}{Re: URGENT}g` [download]	[reply] [d/l]


There's more than one way to do things
	PerlMonks