Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Converting character encodings

by mirod (Canon)
on Jun 01, 2001 at 23:03 UTC ( #85029=CUFP: print w/ replies, xml ) Need Help??

This is partly a cool use of Perl and partly a question.

It shows how to convert data from UTF-8 to latin 1 (and would be very easy to adapt to other encodings), which is really important when using XML::Parser (and in fact nearly all Perl XML modules) as it returns UTF-8 no matter what the encoding of the initial file is.

It gives you the choice of 3 methods:

  • a regexp lifted from XML::TiePYX) which obviously works only for conversion to latin1,
  • using the Unicode::Strings (and Unicode::Map8) modules (lifted somewhere here or on the perl-xml mailing list, I can't remember),
  • using the Text::Iconv module (which needs the iconv library to be available on your machine) which I actually managed to figure out how to use myself, straight from the docs ;--)

Now here is my problem: using Perl 5.6.1 the regexp solution works fine for XML::Parser 2.27 but not for version 2.30 (the tag and attribute names are not converted). I have had various problems with converting encoding recently, be it with XML::TiePYX or XML::Parser, and as I am including such filters in XML::Twig I am wondering if anybody has any idea, and if you could test this script with various combinations of OS, but most important of Perl versions and XML::Parser versions, kust to have an idea of the magnitude of the problem.

Oh, and if anybody has any idea of how to solve this problem that would be very cool of course! Plus I'll take any advice on how to improve this code.

The way I create the filter function with Unicode::Strings and Text::Iconv is a little convoluted, but I needed to do it this way in XML::Twig so I thought I'd leave it as-is just to show how you can pass an extra function reference to XML::Parser::Expat. It would be very easy to simplify and just call a regular subroutine instead.

#!/bin/perl -w # converts XML data from UTF-8 back into latin1 # -r uses a regexp # -u uses Unicode::Strings # -i uses Text::Iconv (and the iconv library) # Note: -r does not work properly with XML::Parser 2.30 use strict; use XML::Parser; print "perl $] - XML::Parser $XML::Parser::VERSION\n"; my $filter; if( $ARGV[0] eq '-r') { $filter = \&latin1; } elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); } elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1'); } else { die "usage: $0 [-r|-u|-i]"; } # I like to escape as little characters as possible # but you might need to escape ' too (with &apos;) my %ent=( '"' => '&quot;', '<' => '&lt;', '&' => '&amp;'); my $p = new XML::Parser( Handlers => { Start => \&start, End => \&end, Default => \&default, }, filter => $filter, ); $p->parse( \*DATA); print "\n"; sub start { my( $p, $tag, %att)= @_; print '<', $p->{filter}->( $tag); while( my( $att, $val)= each %att) { print ' ', $p->{filter}->( $att), '="', $p->{filter}->( $val), + '"'; } print '>'; } sub end { my( $p, $tag)= @_; print '</', $p->{filter}->( $tag), '>'; } sub default { print $p->{filter}->( $_[0]->recognized_string()); } # shamelessly lifted from XML::TyePYX sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } sub unicode_convert { my $enc= shift; require Unicode::Map8; require Unicode::String; import Unicode::String qw(utf8); my $sub= eval q{ { my $cnv; sub { $cnv ||= new Unicode::Map8 ($enc) or die "Can't create converter"; return $cnv->to8 (utf8($_[0])->ucs2); } } }; return $sub; } sub iconv_convert { my $enc= shift; require Text::Iconv; my $sub= eval q{ { my $cnv; sub { $cnv ||= new Text::Iconv( 'utf8', $enc) or die "Can't create converter"; return $cnv->convert( $_[0]); } } }; return $sub; } __DATA__ <?xml version="1.0" encoding="ISO-8859-1"?> <docé té="val'ué">Un homme soupçonné d'être impliqué dans la mort d'un motard de la police, renversé</docé>

Comment on Converting character encodings
Download Code
Re: Converting character encodings
by John M. Dlugosz (Monsignor) on Jun 02, 2001 at 01:05 UTC
    In a Windows box, I can just make a system call (via Win32::API) to do the conversion between encodings.
What are you expecting XML to be in?
by John M. Dlugosz (Monsignor) on Jun 03, 2001 at 21:46 UTC
    I read that XML was always in Unicode. Specifically, encoding was always UTF-8 or UTF-16. Has this been changed since that book was printed, or do people just do it anyway since the attribute is there?

    IAC, the problem of converting from UTF-8 (internal to the script) to whatever encoding the caller wants is rather general.

        That's a proper subset of UTF-8, so not really necessary. Can a particular XML file be represented in, say, 8859-6 or JIS-X, and still be standard? I don't like this because it means that a file can't be read unless the parser knows that character set.

      Actually XML uses UTF-8 or UTF-16 by default (and has ways to figure out which one is used), but allows any encoding, as long as it is specified in the XML declaration (as <?xml version="1.0" encoding="whatever"?>). The parser then has to deal with the encoding.

      It is an implementation choice in expat (and then in XML::Parser) that all strings are passed to the handlers in UTF-8, but I don't think the XML spec mandates this choice.

      And because the environment in which the XML is used often does not support UTF-8, but rather latin 1 or shift-JIS or whatever it is often very important (and painful!) to convert all strings back to their original encoding.

Re: Converting character encodings
by Anonymous Monk on Jun 04, 2001 at 00:51 UTC
    With development versions of Perl you can now use the Encode module like so:
    use Encode qw(encode decode); my $iso_data=encode('iso-8859-1',decode('UTF-8',$utf8_data));
    The list of encodings that Encode currently supports is given by Encode::encodings():
    koi8-r dingbats iso-8859-10 iso-8859-13 cp37 iso-8859-9 iso-8859-6 iso-8859-1 cp1047 iso-8859-4 Internal iso-8859-2 symbol iso-8859-3 US-ascii iso-8859-8 iso-8859-14 UCS-2 iso-8859-5 UTF-8 iso-8859-7 iso-8859-15 cp1250 iso-8859-16 posix-bc
    That list is expandable via inp0ut text file found in the ext/Encode/Encode directory in the perl source tar ball distribution.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://85029]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-10-21 01:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (95 votes), past polls