Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

UTF-8 to ISO-8859-1

by stew (Scribe)
on Mar 04, 2003 at 12:35 UTC ( #240311=perlquestion: print w/replies, xml ) Need Help??
stew has asked for the wisdom of the Perl Monks concerning the following question:

Anybody got any idea how to convert a string from UTF-8 to ISO-8859-1? I've thought about using iconv but I'm not sure how to go about it.

Any advice would be warmly accepted

Replies are listed 'Best First'.
Re: UTF-8 to ISO-8859-1
by mirod (Canon) on Mar 04, 2003 at 13:05 UTC
Re: UTF-8 to ISO-8859-1
by bart (Canon) on Mar 04, 2003 at 16:13 UTC
    On 5.6.x, it can be as simple as
    $latin1 = pack 'C*', unpack 'U*', $utf8;
    On 5.8.0 (and later), use the Encode module.

    Some of the solutions proposed here only work well on pre 5.6 systems, because from 5.6.0 on, perl has built-in magic that automatically converts Latin1 back to UTF-8 (without you asking for it). Like this (on 5.6.1):

    $latin1 = 'lve'; $utf8 = chr(8801); print join ' ', $latin1, $utf8;
    élève ≡
    As you can see, the Latin1 is converted into UTF-8. This will render a lot of the code that used to work on 5.005 and earlier, useless: you can't turn UTF-8 to Latin1, as perl will undo your replacements.

    The mechanism that is behind all that, is that each string has a flag attached to it, much like the taint flag, indicating whether a string is in UTF-8 or in bytes. When you join strings of bytes to strings in UTF-8, perl will convert the bytes strings to UTF-8. The end string is marked as UTF-8 as well. Personally I really really hate this behaviour.

    There are ways around it: in 5.6, using pack, you can turn a string to bytes or to UTF-8, without the bytes themselves being touched, effectively only setting or clearing this UTF-8 flag on the resulting string.

    $bytes = pack 'C0a*', $utf8; $utf8 = pack 'U0a*', $bytes;
    See the docs on pack for 5.6.1. Search for "C0".

    5.8 has less hackish methods built in. See utf8 and Encode.

      What's the solution in 5.005_03?

      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        Well, several solutions have been pointed to. Here's one I have used myself.
        my(%encoding,%decoding); sub UTF8::chr ($) { my $ord = shift; if($ord && $ord < 0x80) { return chr $ord; # OR: pack 'C', $ord; } elsif ($ord < 0x800) { return pack 'C2', 0xC0 | ($ord>>6), 0x80 | ($ord & 0x3F); } else { return pack 'C3', 0xE0 | ($ord>>12), 0x80 | (($ord>>6) & 0x3F) +, 0x80 | ($ord & 0x3F); } } #initialize for my $ord (0, 128 .. 256) { $encoding{chr $ord} = UTF8::chr($ord); } %decoding = reverse %encoding; sub UTF8_to_L1 { foreach (@_ = @_) { s/(\000|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xFF][\x80-\xBF][\x80-\xB +F])/$decoding{$1} || "(#$1#)"/ge; } return wantarray ? @_ : pop; } sub L1_to_UTF8 { foreach (@_ = @_) { s/([\000\x80-\xFF])/$encoding{$1}/g; } return wantarray?@arg:$arg[-1]; }
        In order to make it work for 5.6 too, you need to "disarm" the UTF-8 strings in the UTF8_to_L1 sub, for example using pack('C0a*', $string)

        For completeness sake, here's a sub to turn UTF-8 strings into a ordinal:

        sub UTF8::ord ($) { my $chr = shift; unless ($chr =~ /^([\300-\377][\200-\277]+)/) { return ord $chr; # 1 byte } my @ord = unpack 'C*', $1; if($ord[0] & 0x20) { # 0xE0 .. 0xFF return ($ord[0] & 0x1F)<<12 | ($ord[1] & 0x3F)<<6 | $ord[2] & +0x3F; } else { return ($ord[0] & 0x1F)<<6 | $ord[1] & 0x3F; } }
Re: UTF-8 to ISO-8859-1
by Thelonius (Priest) on Mar 04, 2003 at 12:52 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://240311]
Approved by BrowserUk
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2018-06-19 03:44 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (111 votes). Check out past polls.