Converting character encodings

This is partly a cool use of Perl and partly a question.

It shows how to convert data from UTF-8 to latin 1 (and would be very easy to adapt to other encodings), which is really important when using XML::Parser (and in fact nearly all Perl XML modules) as it returns UTF-8 no matter what the encoding of the initial file is.

It gives you the choice of 3 methods:

a regexp lifted from XML::TiePYX) which obviously works only for conversion to latin1,
using the Unicode::Strings (and Unicode::Map8) modules (lifted somewhere here or on the perl-xml mailing list, I can't remember),
using the Text::Iconv module (which needs the iconv library to be available on your machine) which I actually managed to figure out how to use myself, straight from the docs ;--)

Now here is my problem: using Perl 5.6.1 the regexp solution works fine for XML::Parser 2.27 but not for version 2.30 (the tag and attribute names are not converted). I have had various problems with converting encoding recently, be it with XML::TiePYX or XML::Parser, and as I am including such filters in XML::Twig I am wondering if anybody has any idea, and if you could test this script with various combinations of OS, but most important of Perl versions and XML::Parser versions, kust to have an idea of the magnitude of the problem.

Oh, and if anybody has any idea of how to solve this problem that would be very cool of course! Plus I'll take any advice on how to improve this code.

The way I create the filter function with Unicode::Strings and Text::Iconv is a little convoluted, but I needed to do it this way in XML::Twig so I thought I'd leave it as-is just to show how you can pass an extra function reference to XML::Parser::Expat. It would be very easy to simplify and just call a regular subroutine instead.

#!/bin/perl -w
# converts XML data from UTF-8 back into latin1
# -r uses a regexp
# -u uses Unicode::Strings
# -i uses Text::Iconv (and the iconv library)

# Note: -r does not work properly with XML::Parser 2.30

use strict;
use XML::Parser;

print "perl $] - XML::Parser $XML::Parser::VERSION\n";

my $filter;

if(    $ARGV[0] eq '-r') { $filter = \&latin1;                  }
elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); }
elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1');   }
else { die "usage: $0 [-r|-u|-i]"; }

# I like to escape as little characters as possible
# but you might need to escape ' too (with &apos;)
my %ent=( '"' => '&quot;', '<' => '&lt;', '&' => '&amp;');

my $p = new XML::Parser( Handlers => { Start =>   \&start,
                                       End   =>   \&end,
                                       Default => \&default,
                          },
                  filter     => $filter,
               );
$p->parse( \*DATA);
print "\n";

sub start
  { my( $p, $tag, %att)= @_;
    print '<', $p->{filter}->( $tag);
    while( my( $att, $val)= each %att)
      { print ' ', $p->{filter}->( $att), '="', $p->{filter}->( $val),
+ '"'; }
    print '>';
  }

sub end
  { my( $p, $tag)= @_;
    print '</', $p->{filter}->( $tag), '>';
  } 

sub default
  { print $p->{filter}->( $_[0]->recognized_string()); }

# shamelessly lifted from XML::TyePYX
sub latin1 
  { my $text=shift;
    $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1);
                                my $lo = ord($2);
                                chr((($hi & 0x03) <<6) | ($lo & 0x3F))
                              }ge;
    return $text;
  }

sub unicode_convert
  { my $enc= shift;
    require Unicode::Map8;
    require Unicode::String;
    import Unicode::String qw(utf8);
    my $sub= eval q{
            { my $cnv;
          sub { $cnv ||= new Unicode::Map8 ($enc) 
                  or die "Can't create converter";
            return  $cnv->to8 (utf8($_[0])->ucs2); 
              } 
        } };
    return $sub;
  }

sub iconv_convert
  { my $enc= shift;
    require Text::Iconv;
    my $sub= eval q{
            { my $cnv;
          sub { $cnv ||= new Text::Iconv( 'utf8', $enc) 
                  or die "Can't create converter";
            return  $cnv->convert( $_[0]); 
              } 
        } };
    return $sub;
  }

__DATA__

<?xml version="1.0" encoding="ISO-8859-1"?>
<docé té="val'ué">Un homme soupçonné d'être impliqué dans la mort 
     d'un motard de la police, renversé</docé>
[download]

Comment on Converting character encodings Download Code

Replies are listed 'Best First'.
Re: Converting character encodings by Anonymous Monk on Jun 04, 2001 at 00:51 UTC
With development versions of Perl you can now use the Encode module like so: `use Encode qw(encode decode); my $iso_data=encode('iso-8859-1',decode('UTF-8',$utf8_data));` [download] The list of encodings that Encode currently supports is given by Encode::encodings(): `koi8-r dingbats iso-8859-10 iso-8859-13 cp37 iso-8859-9 iso-8859-6 iso-8859-1 cp1047 iso-8859-4 Internal iso-8859-2 symbol iso-8859-3 US-ascii iso-8859-8 iso-8859-14 UCS-2 iso-8859-5 UTF-8 iso-8859-7 iso-8859-15 cp1250 iso-8859-16 posix-bc` [download] That list is expandable via inp0ut text file found in the ext/Encode/Encode directory in the perl source tar ball distribution.	[reply] [d/l] [select]
What are you expecting XML to be in? by John M. Dlugosz (Monsignor) on Jun 03, 2001 at 21:46 UTC
I read that XML was always in Unicode. Specifically, `encoding` was always UTF-8 or UTF-16. Has this been changed since that book was printed, or do people just do it anyway since the attribute is there? IAC, the problem of converting from UTF-8 (internal to the script) to whatever encoding the caller wants is rather general.	[reply] [d/l]
Re: What are you expecting XML to be in? by merlyn (Sage) on Jun 03, 2001 at 21:56 UTC
An XML compliant parser must support Unicode, but a particular Unicode file can be represented in ISO-8859-1. -- Randal L. Schwartz, Perl hacker	[reply]
Re: Re: What are you expecting XML to be in? by John M. Dlugosz (Monsignor) on Jun 03, 2001 at 22:44 UTC
That's a proper subset of UTF-8, so not really necessary. Can a particular XML file be represented in, say, 8859-6 or JIS-X, and still be standard? I don't like this because it means that a file can't be read unless the parser knows that character set.	[reply]
Re: Re: Re: What are you expecting XML to be in? by merlyn (Sage) on Jun 04, 2001 at 02:19 UTC
Re: Re: Re: Re: What are you expecting XML to be in? by John M. Dlugosz (Monsignor) on Jun 04, 2001 at 07:23 UTC
Re: What are you expecting XML to be in? by mirod (Canon) on Jun 03, 2001 at 22:29 UTC
Actually XML uses UTF-8 or UTF-16 by default (and has ways to figure out which one is used), but allows any encoding, as long as it is specified in the XML declaration (as `<?xml version="1.0" encoding="whatever"?>`). The parser then has to deal with the encoding. It is an implementation choice in expat (and then in XML::Parser) that all strings are passed to the handlers in UTF-8, but I don't think the XML spec mandates this choice. And because the environment in which the XML is used often does not support UTF-8, but rather latin 1 or shift-JIS or whatever it is often very important (and painful!) to convert all strings back to their original encoding.	[reply] [d/l]
Re: Converting character encodings by John M. Dlugosz (Monsignor) on Jun 02, 2001 at 01:05 UTC
In a Windows box, I can just make a system call (via Win32::API) to do the conversion between encodings.	[reply]


"be consistent"
	PerlMonks