http://www.perlmonks.org?node_id=1039058

Jim has asked for the wisdom of the Perl Monks concerning the following question:

I have this character encoding damaged text. It's gibberish, not Chinese.

    敒›剕䕇呎

    U+6552  CJK UNIFIED IDEOGRAPH-6552
    U+203A  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    U+5255  CJK UNIFIED IDEOGRAPH-5255
    U+4547  CJK UNIFIED IDEOGRAPH-4547
    U+544E  CJK UNIFIED IDEOGRAPH-544E

I know this is the original, undamaged text.

    Re: URGENT

I've determined how the damage occurred. The original ten characters were ASCII (UTF-8), but they were mistakenly interpreted as UCS-2LE. Then they were petrified as five bogus characters (mojibake) in Unicode (UTF-8). This is essentially like what happens in the case of the infamous Bush hid the facts bug in Microsoft Notepad.

Here's the pattern.

    R   e   :       U   R   G   E   N   T
    52  65  3A  20  55  52  47  45  4E  54
    U+6552  U+203A  U+5255  U+4547  U+544E
    敒      ›       剕      䕇      呎

How can I reverse this character encoding damage using Perl? I tried using Encode::Repair, but I couldn't get it to work. It seems to me this repair job should be easily accomplished using pack/unpack, but those two functions have always confounded me. I need guidance.

UPDATE:  Here's what I've managed to cobble together. It works, but I'm not impressed. Surely there's a better way.

use v5.16;
use strict;
use warnings;
use utf8;

binmode STDOUT, ':encoding(UTF-8)';

my $damaged_text = '敒›剕䕇呎';

my $repaired_text = '';

while ($damaged_text =~ m/(\X)/g) {
    my ($msb, $lsb) = unpack 'A2A2', sprintf "%04x", ord $1;
    $repaired_text .= chr(hex $lsb) . chr(hex $msb);
}

say $repaired_text; # Prints 'Re: URGENT'

(I had to use <pre> tags instead of <code> tags here because of the Chinese characters in the script.)

  • Comment on How to Fix Character Encoding Damaged Text Using Perl?

Replies are listed 'Best First'.
Re: How to Fix Character Encoding Damaged Text Using Perl?
by Your Mother (Archbishop) on Jun 15, 2013 at 05:51 UTC

    If it's really that straightforward, this kind of reverse pipe-line might do–

    perl -CSD -MEncode -le '$moji = decode "UCS-2LE", "Re: URGENT"; print $moji; print decode "UTF-8", encode "UCS-2LE", $moji'
    敒›剕䕇呎
    Re: URGENT
    

      Thank you very much, Your Mother!

      This works wonderfully, and it makes sense given the known cause of the encoding corruption.

      use v5.16;
      use utf8;
      use open qw( :encoding(UTF-8) :std );
      use Encode qw( encode decode );
      
      while (my $damaged_text = <DATA>) {
          chomp $damaged_text;
      
          my $repaired_text = decode('UTF-8', encode('UCS-2LE', $damaged_text));
      
          say $repaired_text;
      }
      
      close DATA;
      
      exit 0;
      
      __DATA__
      敒›剕䕇呎
      敌馀⁳潧琠韦겜鯥↽
      

      This prints…

          Re: URGENT
          Let’s go to 日本国!
      

      …as expected.

      In this case, use utf8 is required.

Re: How to Fix Character Encoding Damaged Text Using Perl?
by moritz (Cardinal) on Jun 15, 2013 at 07:49 UTC

    Since you already know what sequence of encoding and decoding lead to the broken output, the easiest way with Encode::Repair is this:

    use 5.010;
    use strict;
    use warnings;
    
    use Encode::Repair qw(repair_encoding);
    my $broken = '敒›剕䕇呎';
    say repair_encoding($broken, [decode => 'utf-8', encode => 'UTF-16LE']);
    __END__
    # output:
    Re: URGENT
    

    But it also works with learn_recoding:

    use 5.010;
    use strict;
    use warnings;
    use Encode::Repair qw(repair_encoding learn_recoding);
    binmode STDOUT, ':encoding(UTF-8)';
    my $broken = '敒›剕䕇呎';
    
    my $pattern = learn_recoding(
        from    => $broken,
        to      => 'Re: URGENT',
        encodings => ['UTF-8', 'UTF-16LE', 'UTF-16BE'],
    );
    
    if ($pattern) {
        say repair_encoding($broken, $pattern);
    }
    

    So, what did you try?

    (Updated to use pre tags instead of code, because code tags badly break most non-ASCII-chars.

      Thank you very much, moritz, for your helpful reply. I greatly appreciate it.

      So, what did you try?

      I tried variations of something very similar to your second example using Encode::Repair::learn_recoding(). In hindsight, I believe the problem that thwarted my efforts was my inclusion of use utf8 in the script. Needless to say, I thought I was doing the right thing when, in fact, I was doing exactly the wrong thing.

      I just tested using Encode::Repair to repair damaged text that has non-ASCII characters in it.

          敌馀⁳潧琠韦겜鯥↽
          \u654C\uE274\u9980\u2073\u6F67\u7420\u206F\u97E6\uE6A5\uAC9C\u9BE5\u21BD
          \x4C\x65\x74\xE2\x80\x99\x73\x20\x67\x6F\x20\x74\x6F\x20\xE6\x97\xA5\xE6\x9C\xAC\xE5\x9B\xBD\x21
          \u004C\u0065\u0074\u2019\u0073\u0020\u0067\u006F\u0020\u0074\u006F\u0020\u65E5\u672C\u56FD\u0021
          Let’s go to 日本国!
      

      It works! Here's the script.

      use v5.16;
      
      use Encode::Repair qw( repair_encoding );
      
      while (my $damaged_text = <DATA>) {
          chomp $damaged_text;
      
          my $repaired_text = repair_encoding(
              $damaged_text, [
                  decode => 'UTF-8',
                  encode => 'UCS-2LE',
              ]
          );
      
          say $repaired_text;
      }
      
      exit 0;
      
      __DATA__
      敒›剕䕇呎
      敌馀⁳潧琠韦겜鯥↽
      

      Brilliantly, this prints…

          Re: URGENT
          Let’s go to 日本国!
      

      Notice, however, that I had to remove binmode STDOUT, ':encoding(UTF-8)'. This version of the script also works.

      use v5.16;
      
      use Encode qw( decode );
      use Encode::Repair qw( repair_encoding );
      
      binmode STDOUT, ':encoding(UTF-8)';
      
      while (my $damaged_text = <DATA>) {
          chomp $damaged_text;
      
          my $repaired_text = repair_encoding(
              $damaged_text, [
                  decode => 'UTF-8',
                  encode => 'UCS-2LE',
              ]
          );
      
          say decode('UTF-8', $repaired_text);
      }
      
      exit 0;
      
      __DATA__
      敒›剕䕇呎
      敌馀⁳潧琠韦겜鯥↽
      

      I can study and study and study the Perl documentation, but I'll never grok the subtleties of Perl's Unicode model. It's simply too profoundly confusing for me.

      I love your module, moritz! Thanks for it, and for your help here.

Re: How to Fix Character Encoding Damaged Text Using Perl?
by BrowserUk (Patriarch) on Jun 15, 2013 at 06:56 UTC

    You often keep such text -- without a newline -- in a file on its own?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I work in digital forensics and electronic discovery. In the real world, sh*t happens. I often have to wrestle with damaged evidence of all kinds. Ignoring it is not an option. But I can't just ask the bad guys to recreate the incriminating evidence for me—this time, please, without all the nasty character encoding damaged text in it.

      So, yes, I often keep such text without a newline in a file on its own.

Re: How to Fix Character Encoding Damaged Text Using Perl?
by Anonymous Monk on Jun 15, 2013 at 02:29 UTC

    It's gibberish, not Chinese.

    Then post the BYTES not unicode codepoints

      I posted the Unicode code points and character names to help illustrate how the character encoding damage occurred; that is, to demonstrate the pattern.

      My problem is straightforward:  Using Perl, restore the damaged text '敒›剕䕇呎' to the original text 'Re: URGENT'.

      Whoops :)
      my $bytes = encode( 'UTF-8', $perlunicodestring ); $bytes =~ s{\x52\x65\x3A\x20\x55\x52\x47\x45\x4E\x54}{Re: URGENT}g