Converting UTF8 to Latin1

sumeetgrover has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have tried looking for some information about converting UTF8 data into Latin1 and could not find exactly what I was looking for.

Background: I have a UTF8 RSS feed supplied to us by our clients. I have parsed the RSS data into appropriate data structures.

Now, the problem is that all our database and website pages only support Latin1 / ISO-8859-1 encoding. I would like to encode the UTF8 data into Latin1, but what turns out are some extra or junk characters when converted into Latin1.

QUESTION:
1. Is it possible to convert UTF8 data into Latin1?
2. If it is possible, how to do it?

I have tried using the encode("iso-8859-1",$string) function to encode UTF8 data into Latin1, but as I mentioned, it introduces extra or junk characters.

Any suggestions?

Comment on Converting UTF8 to Latin1 Download Code

Replies are listed 'Best First'.
Re: Converting UTF8 to Latin1 by Corion (Patriarch) on Jun 20, 2012 at 08:32 UTC
As UTF-8 contains lots and lots of characters, and Latin-1 only contains 256 different characters, some of which aren't considered "printable", you cannot do a lossless conversion from UTF-8 down to Latin-1. Using Encode::encode should Just Work still, and either die or replace the unmappable characters to "?". But without any code, it's hard to tell where your attempt goes wrong. If you want to map arbitrary Unicode to ASCII, there also is Text::Unidecode. But it will map for example `\N{SMALL LETTER A WITH DIERESIS}` ("ä") to "a", even though "ä" also exists in Latin-1.	[reply] [d/l]
Re: Converting UTF8 to Latin1 by grantm (Parson) on Jun 20, 2012 at 10:16 UTC
our database and website pages only support Latin1 As Corion explained, what you've asked for can't be done since the set of possible characters in UTF8 is much bigger than the set of possible characters in Latin-1. HTML does allow you to represent any character regardless of the encoding. For example, the character 'Ā' is not in Latin-1 but you can include it in HTML with Ā. However that would only work if you stored HTML encoded text in your database - which would be an odd thing to do. By far the best answer is to update your database and web pages to UTF8.	[reply]
Re: Converting UTF8 to Latin1 by Neighbour (Friar) on Jun 20, 2012 at 08:34 UTC
Close, it's `Encode::encode($encoding_out, Encode::decode($encoding_in, $data));` [download] Where `$encoding_in` and `$encoding_out` contain the encoding of your choice (in your case `UTF8` and `Latin1` (or `iso-8859-1`) respectively). Edit: Also, what Corion said :)	[reply] [d/l] [select]
Re^2: Converting UTF8 to Latin1 by moritz (Cardinal) on Jun 20, 2012 at 10:21 UTC
If you combine `encode` and `decode` this way, you can just as well use `Encode::from_to` directly. See the Encode docs. Though of course the same caveats about the covered character set still applies. Perl 6 - the future is here, just unevenly distributed	[reply] [d/l] [select]
Re: Converting UTF8 to Latin1 by Anonymous Monk on Jun 20, 2012 at 09:04 UTC
Read perlunitut: Unicode in Perl#I/O flow (the actual 5 minute tutorial) and learn the magic incantation, decode input, encode output Is there a way to automatically decode or encode? Why yes , use io encoding layers :)	[reply]
Re: Converting UTF8 to Latin1 by ikegami (Patriarch) on Jun 20, 2012 at 17:27 UTC
The iso-latin-1 character set is much smaller than the Unicode character set. Fortunately, HTML provides a mechanism to encode character not present in the used character set. Option 1 (Works with any encoding): `my $to_entitise = q{<>&"'}; # Unsafe for HTML my $decoded_text = decode('UTF-8', $utf8_text); my $decoded_html = encode_entities($decoded_text, $to_entitise); my $latin1_html = encode('iso-latin-1', $decoded_html, Encode::FB_HTML +CREF);` [download] Option 2 (Leverages our knowledge of iso-latin-1): `my $to_entitise = q{<>&"'} . # Unsafe for HTML q{\x{100}-\x{1FFFFF}}; # Not present in iso-8859-1 my $decoded_text = decode('UTF-8', $utf8_text); my $decoded_html = encode_entities($decoded_text, $to_entitise); my $latin1_html = encode('iso-latin-1', $decoded_html);` [download] Option 3 (Works with any encoding by using HTML entities for more than needed): `my $decoded_text = decode('UTF-8', $utf8_text); my $decoded_html = encode_entities($decoded_text); my $latin1_html = encode('iso-latin-1', $decoded_html);` [download] If your UTF-8 encoded data is HTML rather than text, you can use: `my $to_entitise = q{\x{100}-\x{1FFFFF}}; # Not present in iso-8859-1 my $decoded_html = decode('UTF-8', $utf8_html); $decoded_html = encode_entities($decoded_html, $to_entitise); my $latin1_html = encode('iso-latin-1', $decoded_html);` [download] Common headers and test data: `use charnames qw( :full ); # For \N on older Perls use Encode qw( encode decode ); use HTML::Entities qw( encode_entities ); my $utf8_text = encode('UTF-8', "a\N{U+00E9}\N{U+2660} 1<4"); my $utf8_html = encode('UTF-8', "a\N{U+00E9}\N{U+2660} <b>foo</b>");` [download] Update: Small fixes to bugs found during testing.	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom