http://www.perlmonks.org?node_id=977251

sumeetgrover has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have tried looking for some information about converting UTF8 data into Latin1 and could not find exactly what I was looking for.

Background: I have a UTF8 RSS feed supplied to us by our clients. I have parsed the RSS data into appropriate data structures.

Now, the problem is that all our database and website pages only support Latin1 / ISO-8859-1 encoding. I would like to encode the UTF8 data into Latin1, but what turns out are some extra or junk characters when converted into Latin1.

QUESTION:
1. Is it possible to convert UTF8 data into Latin1?
2. If it is possible, how to do it?

I have tried using the encode("iso-8859-1",$string) function to encode UTF8 data into Latin1, but as I mentioned, it introduces extra or junk characters.

Any suggestions?

Replies are listed 'Best First'.
Re: Converting UTF8 to Latin1
by Corion (Patriarch) on Jun 20, 2012 at 08:32 UTC

    As UTF-8 contains lots and lots of characters, and Latin-1 only contains 256 different characters, some of which aren't considered "printable", you cannot do a lossless conversion from UTF-8 down to Latin-1. Using Encode::encode should Just Work still, and either die or replace the unmappable characters to "?". But without any code, it's hard to tell where your attempt goes wrong.

    If you want to map arbitrary Unicode to ASCII, there also is Text::Unidecode. But it will map for example \N{SMALL LETTER A WITH DIERESIS} ("ä") to "a", even though "ä" also exists in Latin-1.

Re: Converting UTF8 to Latin1
by grantm (Parson) on Jun 20, 2012 at 10:16 UTC
    our database and website pages only support Latin1

    As Corion explained, what you've asked for can't be done since the set of possible characters in UTF8 is much bigger than the set of possible characters in Latin-1.

    HTML does allow you to represent any character regardless of the encoding. For example, the character 'Ā' is not in Latin-1 but you can include it in HTML with Ā. However that would only work if you stored HTML encoded text in your database - which would be an odd thing to do.

    By far the best answer is to update your database and web pages to UTF8.

Re: Converting UTF8 to Latin1
by Neighbour (Friar) on Jun 20, 2012 at 08:34 UTC
    Close, it's
    Encode::encode($encoding_out, Encode::decode($encoding_in, $data));
    Where $encoding_in and $encoding_out contain the encoding of your choice (in your case UTF8 and Latin1 (or iso-8859-1) respectively).

    Edit: Also, what Corion said :)
Re: Converting UTF8 to Latin1
by Anonymous Monk on Jun 20, 2012 at 09:04 UTC
Re: Converting UTF8 to Latin1
by ikegami (Patriarch) on Jun 20, 2012 at 17:27 UTC

    The iso-latin-1 character set is much smaller than the Unicode character set. Fortunately, HTML provides a mechanism to encode character not present in the used character set.

    Option 1 (Works with any encoding):

    my $to_entitise = q{<>&"'}; # Unsafe for HTML my $decoded_text = decode('UTF-8', $utf8_text); my $decoded_html = encode_entities($decoded_text, $to_entitise); my $latin1_html = encode('iso-latin-1', $decoded_html, Encode::FB_HTML +CREF);

    Option 2 (Leverages our knowledge of iso-latin-1):

    my $to_entitise = q{<>&"'} . # Unsafe for HTML q{\x{100}-\x{1FFFFF}}; # Not present in iso-8859-1 my $decoded_text = decode('UTF-8', $utf8_text); my $decoded_html = encode_entities($decoded_text, $to_entitise); my $latin1_html = encode('iso-latin-1', $decoded_html);

    Option 3 (Works with any encoding by using HTML entities for more than needed):

    my $decoded_text = decode('UTF-8', $utf8_text); my $decoded_html = encode_entities($decoded_text); my $latin1_html = encode('iso-latin-1', $decoded_html);

    If your UTF-8 encoded data is HTML rather than text, you can use:

    my $to_entitise = q{\x{100}-\x{1FFFFF}}; # Not present in iso-8859-1 my $decoded_html = decode('UTF-8', $utf8_html); $decoded_html = encode_entities($decoded_html, $to_entitise); my $latin1_html = encode('iso-latin-1', $decoded_html);

    Common headers and test data:

    use charnames qw( :full ); # For \N on older Perls use Encode qw( encode decode ); use HTML::Entities qw( encode_entities ); my $utf8_text = encode('UTF-8', "a\N{U+00E9}\N{U+2660} 1<4"); my $utf8_html = encode('UTF-8', "a\N{U+00E9}\N{U+2660} <b>foo</b>");

    Update: Small fixes to bugs found during testing.