Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Need some encoding help

by jalewis2 (Monk)
on Dec 02, 2010 at 22:26 UTC ( [id://875027]=perlquestion: print w/replies, xml ) Need Help??

jalewis2 has asked for the wisdom of the Perl Monks concerning the following question:

I've probably spent too much time working on this and need some outside eyeballs to get me back on track.

I have some xml files that are gzipped, I rotate through the files and open them with a filehandle that pipes them through gunzip. I can pass the filehandle to XML::Simple and process no problem, but an encoding issue has popped up.

The files are supposed to be UTF-8, but the are acutally ISO-8859-1. I figured I could convert them as I processed each file. But the gzip is tripping me up.

Any suggestions on how to gzip and decode on the fly?

Replies are listed 'Best First'.
Re: Need some encoding help
by roboticus (Chancellor) on Dec 02, 2010 at 23:53 UTC

    jalewis2:

    XML::Simple will also take a string instead of a filehandle, so you could slurp in the file, do a quick patch of the XML header to correct the encoding specification and then pass the text to XML::Simple.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      The issue is that the xml claims to be UTF8, but it has ISO-8859-1 characters. I've successfully converted it from the command line, but when I try to put the gzip, conversion and processing together, it breaks.

      It didn't dawn on me that changing the xml header might tell XML::Simple to read the file differently. Is that what you're suggesting?

        jalewis2:

        Exactly. I'm thinking that you may be able to update the xml header to add the proper encoding attribute, something like <?xml version='1.0' encoding='ISO-8859-1' ?> as described in the Section 4.3 "Parsed Entities" in the XML specification. I don't know if this trick will work, as I've never used it. But hopefully, it'll get you past your hurdle. I'm not very knowledgeable about XML, so

        Another link I use when trying to debug XML problems is: The W3C XML Markup Validation Service. That's where I point people to when they tell me I'm wrong about their file being incorrect.

        ...roboticus

        When your only tool is a hammerJava, all problems look like your thumbXML.

Re: Need some encoding help
by zentara (Archbishop) on Dec 03, 2010 at 18:25 UTC
    Encode detector may be of help.

    You could open up each file, detect a sampling of the octet stream, then properly encode the non-utf8 ones.

    #!/usr/bin/perl use warnings; use strict; use Encode::Detect::Detector; my $octets = "\x{4f60}\x{597d}\x{4e16}\x{754c}"; my $charset = Encode::Detect::Detector::detect($octets); print "$charset\n"; $octets = "\x82\xb7\x82\xb2\x82\xa2\x82\xcc\x82\xdd\x82\xc2"; $charset = Encode::Detect::Detector::detect($octets); print "$charset\n"; $octets = "\x{805a}\x{5408}\x{6216}\x{8be6}\x{7ec6}"; $charset = Encode::Detect::Detector::detect($octets); print "$charset\n";

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://875027]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-25 14:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found