Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Mixed ISO-8859/UTF-8 conversion

by olli (Initiate)
on Oct 04, 2007 at 10:14 UTC ( #642617=snippet: print w/ replies, xml ) Need Help??

Description: I had a problem with an application that produced a horrible mixed UTF-8 and ISO-8859 encoded XML output. I found this way to transform it to pure UTF-8 without double-encoding the UTF-8 sequences that were already there. I know this will not work in all cases, but it has been helpful. What do you think?
#!/usr/bin/perl

use strict;

# mixed string with ISO 8859-1 und UTF-8:
my $test_string = "Das Å (auch \"bolle-Å\" genannt, was soviel bedeute
+t wie \"Kringel-Å\") ist mit der ".
    force_utf8("dänischen Rechtschreibreform von 1948 eingeführt worde
+n.");

print "Source: $test_string\n";
print "UTF   : ".force_utf8($test_string)."\n";
print "ISO   : ".force_latin($test_string)."\n";

sub force_utf8 {
    my $string = shift;
    
    $string =~ s/([\xc0-\xdf][\x80-\xbf]{1}|[\xe0-\xef][\x80-\xbf]{2}|
+[\xf0-\xf7][\x80-\xbf]{3}|[\x80-\xff])/&encode_char_utf8($1)/ge;
    
    return $string;
    
}

sub force_latin {
    my $string = shift;
    
    $string =~ s/([\xc0-\xdf][\x80-\xbf]{1}|[\xe0-\xef][\x80-\xbf]{2}|
+[\xf0-\xf7][\x80-\xbf]{3}|[\x80-\xff])/&decode_char_utf8($1)/ge;
    
    return $string;
    
}
     
sub encode_char_utf8 {
    my $char = shift;

    if($char =~ /^([\xc0-\xdf][\x80-\xbf]{1}|[\xe0-\xef][\x80-\xbf]{2}
+|[\xf0-\xf7][\x80-\xbf]{3})$/) {
        return $char;
    }
        
    my $value = ord($char);
    
    return chr(($value>>6) | 0xc0).chr(0x80 | ($value & 0x3f));
}

sub decode_char_utf8 {
    my $char = shift;
        
    if($char =~ /^([\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3}
+)$/) {
        return '';
    } elsif($char =~ /^([\xc0-\xdf])([\x80-\xbf])$/) {
    
        my $value = ((ord($1) & 0x1f)<<6)+(ord($2) & 0x3f);
    
        if($value<256) {
            return chr($value);
        } else {
            return '';
        }
        
    } else {
        return $char;
    }
    
}
Comment on Mixed ISO-8859/UTF-8 conversion
Download Code
Re: Mixed ISO-8859/UTF-8 conversion
by zby (Vicar) on Oct 04, 2007 at 10:52 UTC
    The horror ... the horror ...

      The jerk ... the jerk ...

      You could at least explain /why/ you think it is horrific, and perhaps even include suggestions for a better solution.

        That was more about the situation that the OP has to face than about the solution :)
Re: Mixed ISO-8859/UTF-8 conversion
by Juerd (Abbot) on Oct 04, 2007 at 11:04 UTC

    Please use Perl's built in support for encodings. Read perlunitut to find out all about Perl's Unicode strings.

    sub decode_utf8_latin1 { my ($encoded) = @_; my $decoded; while (length $encoded) { $decoded .= decode("UTF-8", $encoded, Encode::FB_QUIET); $decoded .= substr($encoded, 0, 1, "") if length $encoded; } return $decoded; }

    If CHECK = coderef worked, you could just write decode("UTF-8", $encoded, sub { chr shift }, but alas, it doesn't work.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Wow. It's a bit hard to understand. I didn't expect Encode::FB_QUIET to change $encoded. But that's a better way, thanks a lot.

        I didn't expect Encode::FB_QUIET to change $encoded.

        All I can say is RTFM ;-)

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://642617]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2014-12-21 01:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (100 votes), past polls