Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

UTF-8 to ISO-8859-1

by stew (Scribe)
on Mar 04, 2003 at 12:35 UTC ( #240311=perlquestion: print w/ replies, xml ) Need Help??
stew has asked for the wisdom of the Perl Monks concerning the following question:

Anybody got any idea how to convert a string from UTF-8 to ISO-8859-1? I've thought about using iconv but I'm not sure how to go about it.

Any advice would be warmly accepted

Comment on UTF-8 to ISO-8859-1
Re: UTF-8 to ISO-8859-1
by Thelonius (Curate) on Mar 04, 2003 at 12:52 UTC
Re: UTF-8 to ISO-8859-1
by mirod (Canon) on Mar 04, 2003 at 13:05 UTC
Re: UTF-8 to ISO-8859-1
by bart (Canon) on Mar 04, 2003 at 16:13 UTC
    On 5.6.x, it can be as simple as
    $latin1 = pack 'C*', unpack 'U*', $utf8;
    On 5.8.0 (and later), use the Encode module.

    Some of the solutions proposed here only work well on pre 5.6 systems, because from 5.6.0 on, perl has built-in magic that automatically converts Latin1 back to UTF-8 (without you asking for it). Like this (on 5.6.1):

    $latin1 = 'lve'; $utf8 = chr(8801); print join ' ', $latin1, $utf8;
    Result:
    élève ≡
    As you can see, the Latin1 is converted into UTF-8. This will render a lot of the code that used to work on 5.005 and earlier, useless: you can't turn UTF-8 to Latin1, as perl will undo your replacements.

    The mechanism that is behind all that, is that each string has a flag attached to it, much like the taint flag, indicating whether a string is in UTF-8 or in bytes. When you join strings of bytes to strings in UTF-8, perl will convert the bytes strings to UTF-8. The end string is marked as UTF-8 as well. Personally I really really hate this behaviour.

    There are ways around it: in 5.6, using pack, you can turn a string to bytes or to UTF-8, without the bytes themselves being touched, effectively only setting or clearing this UTF-8 flag on the resulting string.

    $bytes = pack 'C0a*', $utf8; $utf8 = pack 'U0a*', $bytes;
    See the docs on pack for 5.6.1. Search for "C0".

    5.8 has less hackish methods built in. See utf8 and Encode.

      What's the solution in 5.005_03?

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        Well, several solutions have been pointed to. Here's one I have used myself.
        my(%encoding,%decoding); sub UTF8::chr ($) { my $ord = shift; if($ord && $ord < 0x80) { return chr $ord; # OR: pack 'C', $ord; } elsif ($ord < 0x800) { return pack 'C2', 0xC0 | ($ord>>6), 0x80 | ($ord & 0x3F); } else { return pack 'C3', 0xE0 | ($ord>>12), 0x80 | (($ord>>6) & 0x3F) +, 0x80 | ($ord & 0x3F); } } #initialize for my $ord (0, 128 .. 256) { $encoding{chr $ord} = UTF8::chr($ord); } %decoding = reverse %encoding; sub UTF8_to_L1 { foreach (@_ = @_) { s/(\000|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xFF][\x80-\xBF][\x80-\xB +F])/$decoding{$1} || "(#$1#)"/ge; } return wantarray ? @_ : pop; } sub L1_to_UTF8 { foreach (@_ = @_) { s/([\000\x80-\xFF])/$encoding{$1}/g; } return wantarray?@arg:$arg[-1]; }
        In order to make it work for 5.6 too, you need to "disarm" the UTF-8 strings in the UTF8_to_L1 sub, for example using pack('C0a*', $string)

        For completeness sake, here's a sub to turn UTF-8 strings into a ordinal:

        sub UTF8::ord ($) { my $chr = shift; unless ($chr =~ /^([\300-\377][\200-\277]+)/) { return ord $chr; # 1 byte } my @ord = unpack 'C*', $1; if($ord[0] & 0x20) { # 0xE0 .. 0xFF return ($ord[0] & 0x1F)<<12 | ($ord[1] & 0x3F)<<6 | $ord[2] & +0x3F; } else { return ($ord[0] & 0x1F)<<6 | $ord[1] & 0x3F; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://240311]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-08-30 08:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls