Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

How to Fix Character Encoding Damaged Text Using Perl?

by Jim (Curate)
on Jun 15, 2013 at 01:52 UTC ( [id://1039058]=perlquestion: print w/replies, xml ) Need Help??

Jim has asked for the wisdom of the Perl Monks concerning the following question:

I have this character encoding damaged text. It's gibberish, not Chinese.

    敒›剕䕇呎

    U+6552  CJK UNIFIED IDEOGRAPH-6552
    U+203A  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    U+5255  CJK UNIFIED IDEOGRAPH-5255
    U+4547  CJK UNIFIED IDEOGRAPH-4547
    U+544E  CJK UNIFIED IDEOGRAPH-544E

I know this is the original, undamaged text.

    Re: URGENT

I've determined how the damage occurred. The original ten characters were ASCII (UTF-8), but they were mistakenly interpreted as UCS-2LE. Then they were petrified as five bogus characters (mojibake) in Unicode (UTF-8). This is essentially like what happens in the case of the infamous Bush hid the facts bug in Microsoft Notepad.

Here's the pattern.

    R   e   :       U   R   G   E   N   T
    52  65  3A  20  55  52  47  45  4E  54
    U+6552  U+203A  U+5255  U+4547  U+544E
    敒      ›       剕      䕇      呎

How can I reverse this character encoding damage using Perl? I tried using Encode::Repair, but I couldn't get it to work. It seems to me this repair job should be easily accomplished using pack/unpack, but those two functions have always confounded me. I need guidance.

UPDATE:  Here's what I've managed to cobble together. It works, but I'm not impressed. Surely there's a better way.

use v5.16;
use strict;
use warnings;
use utf8;

binmode STDOUT, ':encoding(UTF-8)';

my $damaged_text = '敒›剕䕇呎';

my $repaired_text = '';

while ($damaged_text =~ m/(\X)/g) {
    my ($msb, $lsb) = unpack 'A2A2', sprintf "%04x", ord $1;
    $repaired_text .= chr(hex $lsb) . chr(hex $msb);
}

say $repaired_text; # Prints 'Re: URGENT'

(I had to use <pre> tags instead of <code> tags here because of the Chinese characters in the script.)

Replies are listed 'Best First'.
Re: How to Fix Character Encoding Damaged Text Using Perl?
by Your Mother (Archbishop) on Jun 15, 2013 at 05:51 UTC

    If it's really that straightforward, this kind of reverse pipe-line might do–

    perl -CSD -MEncode -le '$moji = decode "UCS-2LE", "Re: URGENT"; print $moji; print decode "UTF-8", encode "UCS-2LE", $moji'
    敒›剕䕇呎
    Re: URGENT
    

      Thank you very much, Your Mother!

      This works wonderfully, and it makes sense given the known cause of the encoding corruption.

      use v5.16;
      use utf8;
      use open qw( :encoding(UTF-8) :std );
      use Encode qw( encode decode );
      
      while (my $damaged_text = <DATA>) {
          chomp $damaged_text;
      
          my $repaired_text = decode('UTF-8', encode('UCS-2LE', $damaged_text));
      
          say $repaired_text;
      }
      
      close DATA;
      
      exit 0;
      
      __DATA__
      敒›剕䕇呎
      敌馀⁳潧琠韦겜鯥↽
      

      This prints…

          Re: URGENT
          Let’s go to 日本国!
      

      …as expected.

      In this case, use utf8 is required.

Re: How to Fix Character Encoding Damaged Text Using Perl?
by moritz (Cardinal) on Jun 15, 2013 at 07:49 UTC

    Since you already know what sequence of encoding and decoding lead to the broken output, the easiest way with Encode::Repair is this:

    use 5.010;
    use strict;
    use warnings;
    
    use Encode::Repair qw(repair_encoding);
    my $broken = '敒›剕䕇呎';
    say repair_encoding($broken, [decode => 'utf-8', encode => 'UTF-16LE']);
    __END__
    # output:
    Re: URGENT
    

    But it also works with learn_recoding:

    use 5.010;
    use strict;
    use warnings;
    use Encode::Repair qw(repair_encoding learn_recoding);
    binmode STDOUT, ':encoding(UTF-8)';
    my $broken = '敒›剕䕇呎';
    
    my $pattern = learn_recoding(
        from    => $broken,
        to      => 'Re: URGENT',
        encodings => ['UTF-8', 'UTF-16LE', 'UTF-16BE'],
    );
    
    if ($pattern) {
        say repair_encoding($broken, $pattern);
    }
    

    So, what did you try?

    (Updated to use pre tags instead of code, because code tags badly break most non-ASCII-chars.

      Thank you very much, moritz, for your helpful reply. I greatly appreciate it.

      So, what did you try?

      I tried variations of something very similar to your second example using Encode::Repair::learn_recoding(). In hindsight, I believe the problem that thwarted my efforts was my inclusion of use utf8 in the script. Needless to say, I thought I was doing the right thing when, in fact, I was doing exactly the wrong thing.

      I just tested using Encode::Repair to repair damaged text that has non-ASCII characters in it.

          敌馀⁳潧琠韦겜鯥↽
          \u654C\uE274\u9980\u2073\u6F67\u7420\u206F\u97E6\uE6A5\uAC9C\u9BE5\u21BD
          \x4C\x65\x74\xE2\x80\x99\x73\x20\x67\x6F\x20\x74\x6F\x20\xE6\x97\xA5\xE6\x9C\xAC\xE5\x9B\xBD\x21
          \u004C\u0065\u0074\u2019\u0073\u0020\u0067\u006F\u0020\u0074\u006F\u0020\u65E5\u672C\u56FD\u0021
          Let’s go to 日本国!
      

      It works! Here's the script.

      use v5.16;
      
      use Encode::Repair qw( repair_encoding );
      
      while (my $damaged_text = <DATA>) {
          chomp $damaged_text;
      
          my $repaired_text = repair_encoding(
              $damaged_text, [
                  decode => 'UTF-8',
                  encode => 'UCS-2LE',
              ]
          );
      
          say $repaired_text;
      }
      
      exit 0;
      
      __DATA__
      敒›剕䕇呎
      敌馀⁳潧琠韦겜鯥↽
      

      Brilliantly, this prints…

          Re: URGENT
          Let’s go to 日本国!
      

      Notice, however, that I had to remove binmode STDOUT, ':encoding(UTF-8)'. This version of the script also works.

      use v5.16;
      
      use Encode qw( decode );
      use Encode::Repair qw( repair_encoding );
      
      binmode STDOUT, ':encoding(UTF-8)';
      
      while (my $damaged_text = <DATA>) {
          chomp $damaged_text;
      
          my $repaired_text = repair_encoding(
              $damaged_text, [
                  decode => 'UTF-8',
                  encode => 'UCS-2LE',
              ]
          );
      
          say decode('UTF-8', $repaired_text);
      }
      
      exit 0;
      
      __DATA__
      敒›剕䕇呎
      敌馀⁳潧琠韦겜鯥↽
      

      I can study and study and study the Perl documentation, but I'll never grok the subtleties of Perl's Unicode model. It's simply too profoundly confusing for me.

      I love your module, moritz! Thanks for it, and for your help here.

Re: How to Fix Character Encoding Damaged Text Using Perl?
by BrowserUk (Patriarch) on Jun 15, 2013 at 06:56 UTC

    You often keep such text -- without a newline -- in a file on its own?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I work in digital forensics and electronic discovery. In the real world, sh*t happens. I often have to wrestle with damaged evidence of all kinds. Ignoring it is not an option. But I can't just ask the bad guys to recreate the incriminating evidence for me—this time, please, without all the nasty character encoding damaged text in it.

      So, yes, I often keep such text without a newline in a file on its own.

Re: How to Fix Character Encoding Damaged Text Using Perl?
by Anonymous Monk on Jun 15, 2013 at 02:29 UTC

    It's gibberish, not Chinese.

    Then post the BYTES not unicode codepoints

      I posted the Unicode code points and character names to help illustrate how the character encoding damage occurred; that is, to demonstrate the pattern.

      My problem is straightforward:  Using Perl, restore the damaged text '敒›剕䕇呎' to the original text 'Re: URGENT'.

      Whoops :)
      my $bytes = encode( 'UTF-8', $perlunicodestring ); $bytes =~ s{\x52\x65\x3A\x20\x55\x52\x47\x45\x4E\x54}{Re: URGENT}g

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1039058]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-03-19 07:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found