If it's really that straightforward, this kind of reverse pipe-line might do–
perl -CSD -MEncode -le '$moji = decode "UCS-2LE", "Re: URGENT"; print $moji; print decode "UTF-8", encode "UCS-2LE", $moji'
敒›剕䕇呎
Re: URGENT
| [reply] [d/l] |
Thank you very much, Your Mother!
This works wonderfully, and it makes sense given the known cause of the encoding corruption.
use v5.16;
use utf8;
use open qw( :encoding(UTF-8) :std );
use Encode qw( encode decode );
while (my $damaged_text = <DATA>) {
chomp $damaged_text;
my $repaired_text = decode('UTF-8', encode('UCS-2LE', $damaged_text));
say $repaired_text;
}
close DATA;
exit 0;
__DATA__
敒›剕䕇呎
敌馀潧琠韦겜鯥↽
This prints…
Re: URGENT
Let’s go to 日本国!
…as expected.
In this case, use utf8 is required.
| [reply] [d/l] |
Since you already know what sequence of encoding and decoding lead to the broken output, the easiest way with Encode::Repair is this:
use 5.010;
use strict;
use warnings;
use Encode::Repair qw(repair_encoding);
my $broken = '敒›剕䕇呎';
say repair_encoding($broken, [decode => 'utf-8', encode => 'UTF-16LE']);
__END__
# output:
Re: URGENT
But it also works with learn_recoding:
use 5.010;
use strict;
use warnings;
use Encode::Repair qw(repair_encoding learn_recoding);
binmode STDOUT, ':encoding(UTF-8)';
my $broken = '敒›剕䕇呎';
my $pattern = learn_recoding(
from => $broken,
to => 'Re: URGENT',
encodings => ['UTF-8', 'UTF-16LE', 'UTF-16BE'],
);
if ($pattern) {
say repair_encoding($broken, $pattern);
}
So, what did you try?
(Updated to use pre tags instead of code, because code tags badly break most non-ASCII-chars.
| [reply] |
Thank you very much, moritz, for your helpful reply. I greatly appreciate it.
So, what did you try?
I tried variations of something very similar to your second example using Encode::Repair::learn_recoding(). In hindsight, I believe the problem that thwarted my efforts was my inclusion of use utf8 in the script. Needless to say, I thought I was doing the right thing when, in fact, I was doing exactly the wrong thing.
I just tested using Encode::Repair to repair damaged text that has non-ASCII characters in it.
敌馀潧琠韦겜鯥↽
\u654C\uE274\u9980\u2073\u6F67\u7420\u206F\u97E6\uE6A5\uAC9C\u9BE5\u21BD
\x4C\x65\x74\xE2\x80\x99\x73\x20\x67\x6F\x20\x74\x6F\x20\xE6\x97\xA5\xE6\x9C\xAC\xE5\x9B\xBD\x21
\u004C\u0065\u0074\u2019\u0073\u0020\u0067\u006F\u0020\u0074\u006F\u0020\u65E5\u672C\u56FD\u0021
Let’s go to 日本国!
It works! Here's the script.
use v5.16;
use Encode::Repair qw( repair_encoding );
while (my $damaged_text = <DATA>) {
chomp $damaged_text;
my $repaired_text = repair_encoding(
$damaged_text, [
decode => 'UTF-8',
encode => 'UCS-2LE',
]
);
say $repaired_text;
}
exit 0;
__DATA__
敒›剕䕇呎
敌馀潧琠韦겜鯥↽
Brilliantly, this prints…
Re: URGENT
Let’s go to 日本国!
Notice, however, that I had to remove binmode STDOUT, ':encoding(UTF-8)'. This version of the script also works.
use v5.16;
use Encode qw( decode );
use Encode::Repair qw( repair_encoding );
binmode STDOUT, ':encoding(UTF-8)';
while (my $damaged_text = <DATA>) {
chomp $damaged_text;
my $repaired_text = repair_encoding(
$damaged_text, [
decode => 'UTF-8',
encode => 'UCS-2LE',
]
);
say decode('UTF-8', $repaired_text);
}
exit 0;
__DATA__
敒›剕䕇呎
敌馀潧琠韦겜鯥↽
I can study and study and study the Perl documentation, but I'll never grok the subtleties of Perl's Unicode model. It's simply too profoundly confusing for me.
I love your module, moritz! Thanks for it, and for your help here.
| [reply] [d/l] [select] |
| [reply] |
I work in digital forensics and electronic discovery. In the real world, sh*t happens. I often have to wrestle with damaged evidence of all kinds. Ignoring it is not an option. But I can't just ask the bad guys to recreate the incriminating evidence for me—this time, please, without all the nasty character encoding damaged text in it.
So, yes, I often keep such text without a newline in a file on its own.
| [reply] |
It's gibberish, not Chinese. Then post the BYTES not unicode codepoints
| [reply] |
I posted the Unicode code points and character names to help illustrate how the character encoding damage occurred; that is, to demonstrate the pattern.
My problem is straightforward: Using Perl, restore the damaged text '敒›剕䕇呎' to the original text 'Re: URGENT'.
| [reply] |
my $bytes = encode( 'UTF-8', $perlunicodestring );
$bytes =~ s{\x52\x65\x3A\x20\x55\x52\x47\x45\x4E\x54}{Re: URGENT}g
| [reply] [d/l] |