Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

UTF-8 to Latin1 - unmatched characters?

by uncommon13 (Novice)
on Mar 20, 2008 at 16:36 UTC ( #675248=perlquestion: print w/replies, xml ) Need Help??
uncommon13 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have used the Perl module Encode to convert data from a database containing UTF-8 data to Latin1 when outputting to a file. Basically, the code is:
open (FILE, ">:encoding(iso-8859-1)", "$file");
It works fine except that some characters such as quotes, double quotes, dashes, astrophes are coded as, for example:
becomes \x{201c} becomes \x{2013}
The final latin1 output file is an XML file. Is there anyway to convert these to the proper characters under latin1? Would numeric character entities be used since it would be XML file? Is the reason for their insertion since they are non-matching latin1 characters from the UTF-8 conversion? Is there a module or subroutine that could convert these for me? Thanks

Replies are listed 'Best First'.
Re: UTF-8 to Latin1 - unmatched characters?
by Joost (Canon) on Mar 20, 2008 at 16:44 UTC
    Converting to latin-1 only works if the characters used are actually in the latin-1 character set.

    LEFT DOUBLE QUOTATION MARK and EN DASH only look sort of like " and - but they're not the same characters. And you should get a warning trying to convert them

    Using numeric entities should work, but I wonder why you're not just encoding the XML file as UTF-8.

Re: UTF-8 to Latin1 - unmatched characters?
by samtregar (Abbot) on Mar 20, 2008 at 16:48 UTC
      Thanks Sam. I used this, however, it also converts the other valid latin1 characters to ASCII.

      So, I found this which converts non-matched UTF-8 characters to something:

      So basically, the code would be something like:

      # Converted UTF codes for non-matching ISO-8859-1 # Strip it down to basic ASCII %utf_entity = ( "\x{2019}", "'", "\x{201c}", '"', "\x{201d}", '"', "\x{2026}", "...", "\x{fffd}", "", ); s/(\X)/ exists $utf_entity{$1} ? $utf_entity{$1} : $1 /eg;
        I was going to recommend passing only characters that don't exist in iso-latin-1 to unidecode using a fallback handler to encode. It works, but I'm getting an error (Close with partial character.) when the file handle is closed, and I have no idea how to fix it.

        Here's the code anyway:

        use strict; use warnings; use PerlIO::encoding qw( ); use Text::Unidecode qw( unidecode ); use constant FB_UNIDECODE => sub { unidecode(chr($_[0])) }; my $file = '...'; local $PerlIO::encoding::fallback = FB_UNIDECODE; open(my $fh, '>:encoding(iso-8859-1)', $file) or die("Unable to create file \"$file\": $!\n"); print $fh "abc\x{201C}def\x{2013}ghi";
Re: UTF-8 to Latin1 - unmatched characters?
by Juerd (Abbot) on Mar 20, 2008 at 16:46 UTC

    These characters do not exist in latin1. Perl is using an emergency fallback so you know what's going on.

Re: UTF-8 to Latin1 - unmatched characters?
by Anonymous Monk on Mar 21, 2008 at 14:00 UTC

    ISO Latin-1 does not contain these characters, but ANSI cp1252 does. This is an extension by Microsoft, beware of compatibility problems! Stick to ISO standards like 8859 or Unicode/UTF-8, if you can.

    The name is cp1252. The IANA name is windows-1252, this is what you use in the <?xml ...> PI or HTTP Content-Type:...;charset=... header.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://675248]
Approved by Joost
Front-paged by jdporter
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2017-05-01 06:21 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (545 votes). Check out past polls.