Latin-1 (iso-8859-1) is a subset of Unicode. UTF-8 is an algorithic transform of Unicode, which maps characters > 127 to multiple bytes. See rfc2279 for details or the Unicode site.
If you know that your characters are all from the Latin-1 character set (but in the UTF-8 encoding), you can just do this: pack "C*", unpack "U*", $_
This maps directly to Latin-1. But for other character sets, you'll need table-driven mappings. There are modules that do this. See the Unicode::Map and similar modules on CPAN.Here is a quickie which just handles Windows-1252: #!perl -w
use strict;
my %unicode2win1252 = (
0x0152 => 0x8C, 0x0153 => 0x9C, 0x0160 => 0x8A, 0x0161 => 0x9A,
0x0178 => 0x9F, 0x017D => 0x8E, 0x017E => 0x9E, 0x0192 => 0x83,
0x02C6 => 0x88, 0x02DC => 0x98, 0x2013 => 0x96, 0x2014 => 0x97,
0x2018 => 0x91, 0x2019 => 0x92, 0x201A => 0x82, 0x201C => 0x93,
0x201D => 0x94, 0x201E => 0x84, 0x2020 => 0x86, 0x2021 => 0x87,
0x2022 => 0x95, 0x2026 => 0x85, 0x2030 => 0x89, 0x2039 => 0x8B,
0x203A => 0x9B, 0x20AC => 0x80, 0x2122 => 0x99,
);
sub simplemap {
my ($map, $str) = @_;
pack "C*", map { $$map{$_}||$_ } unpack "U*", $str
}
my $a = "This is a " . pack("U*", 0x201c) . "test" . pack("U*", 0x201d
+)
. " Okay, Jos" . pack("U*", 0xe9) . "?" . pack("U*", 0xfeff);
# The last character U+FEFF is not in Windows-1252 and is thrown in
# as an example of what happens to other characters.
my $b = simplemap(\%unicode2win1252, $a);
my $c = unpack("H*", $b);
print "a = $a\nb = $b\nc = $c\n";
There are C and Java conversion routines at the ICU project. I derived the hash %unicode2win1252 from the data file data/ibm-5348.ucm. See data/convrtrs.txt for the names of the character sets.
|