Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: How to Fix Character Encoding Damaged Text Using Perl?

by moritz (Cardinal)
on Jun 15, 2013 at 07:49 UTC ( #1039081=note: print w/ replies, xml ) Need Help??


in reply to How to Fix Character Encoding Damaged Text Using Perl?

Since you already know what sequence of encoding and decoding lead to the broken output, the easiest way with Encode::Repair is this:

use 5.010;
use strict;
use warnings;

use Encode::Repair qw(repair_encoding);
my $broken = '敒›剕䕇呎';
say repair_encoding($broken, [decode => 'utf-8', encode => 'UTF-16LE']);
__END__
# output:
Re: URGENT

But it also works with learn_recoding:

use 5.010;
use strict;
use warnings;
use Encode::Repair qw(repair_encoding learn_recoding);
binmode STDOUT, ':encoding(UTF-8)';
my $broken = '敒›剕䕇呎';

my $pattern = learn_recoding(
    from    => $broken,
    to      => 'Re: URGENT',
    encodings => ['UTF-8', 'UTF-16LE', 'UTF-16BE'],
);

if ($pattern) {
    say repair_encoding($broken, $pattern);
}

So, what did you try?

(Updated to use pre tags instead of code, because code tags badly break most non-ASCII-chars.


Comment on Re: How to Fix Character Encoding Damaged Text Using Perl?
Re^2: How to Fix Character Encoding Damaged Text Using Perl?
by Jim (Curate) on Jun 15, 2013 at 18:37 UTC

    Thank you very much, moritz, for your helpful reply. I greatly appreciate it.

    So, what did you try?

    I tried variations of something very similar to your second example using Encode::Repair::learn_recoding(). In hindsight, I believe the problem that thwarted my efforts was my inclusion of use utf8 in the script. Needless to say, I thought I was doing the right thing when, in fact, I was doing exactly the wrong thing.

    I just tested using Encode::Repair to repair damaged text that has non-ASCII characters in it.

        敌馀⁳潧琠韦겜鯥↽
        \u654C\uE274\u9980\u2073\u6F67\u7420\u206F\u97E6\uE6A5\uAC9C\u9BE5\u21BD
        \x4C\x65\x74\xE2\x80\x99\x73\x20\x67\x6F\x20\x74\x6F\x20\xE6\x97\xA5\xE6\x9C\xAC\xE5\x9B\xBD\x21
        \u004C\u0065\u0074\u2019\u0073\u0020\u0067\u006F\u0020\u0074\u006F\u0020\u65E5\u672C\u56FD\u0021
        Let’s go to 日本国!
    

    It works! Here's the script.

    use v5.16;
    
    use Encode::Repair qw( repair_encoding );
    
    while (my $damaged_text = <DATA>) {
        chomp $damaged_text;
    
        my $repaired_text = repair_encoding(
            $damaged_text, [
                decode => 'UTF-8',
                encode => 'UCS-2LE',
            ]
        );
    
        say $repaired_text;
    }
    
    exit 0;
    
    __DATA__
    敒›剕䕇呎
    敌馀⁳潧琠韦겜鯥↽
    

    Brilliantly, this prints…

        Re: URGENT
        Let’s go to 日本国!
    

    Notice, however, that I had to remove binmode STDOUT, ':encoding(UTF-8)'. This version of the script also works.

    use v5.16;
    
    use Encode qw( decode );
    use Encode::Repair qw( repair_encoding );
    
    binmode STDOUT, ':encoding(UTF-8)';
    
    while (my $damaged_text = <DATA>) {
        chomp $damaged_text;
    
        my $repaired_text = repair_encoding(
            $damaged_text, [
                decode => 'UTF-8',
                encode => 'UCS-2LE',
            ]
        );
    
        say decode('UTF-8', $repaired_text);
    }
    
    exit 0;
    
    __DATA__
    敒›剕䕇呎
    敌馀⁳潧琠韦겜鯥↽
    

    I can study and study and study the Perl documentation, but I'll never grok the subtleties of Perl's Unicode model. It's simply too profoundly confusing for me.

    I love your module, moritz! Thanks for it, and for your help here.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1039081]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (10)
As of 2014-10-31 17:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (221 votes), past polls