Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Problem reading sign with XML::Simple

by gothic_mallard (Pilgrim)
on Apr 27, 2005 at 08:42 UTC ( #451866=perlquestion: print w/ replies, xml ) Need Help??
gothic_mallard has asked for the wisdom of the Perl Monks concerning the following question:

Hi all
I'm having problems using XML::Simple to parse a data file containing the '' symbol.

The data appears to read in correctly without error, but when I come to output it, the symbol appears as '£'.

Trawling as much documentation as I could find, SuperSearching and googling it looks like this is a problem with the encoding - the XML::Parser transforms everything to UTF-8 internally which can't handle '' correctly. If I understand correctly it performs a bit shift that changes it into the 2-byte char that gets output.

The problem I have is how to get the correct symbol back when I read the data back out.

I've tried playing with the various encoding options of XML::Simple (and I've tried XML::DOM also) but to no avail. Maybe I'm missing something obvious by staring at it too hard.

Any help would be greatly appreciated :)

Update

Thanks for all the helpful advice everyone! I think I have a few possible solutions now to the problem (which I'll have to find which will work best under the system / network architecture that we have here) which is a lot more than the "none" that I had when I started :)

You've saved me a lot of torn out hair :)

--- Jay

All code is untested unless otherwise stated.
All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
If in doubt ask.

s++blah+;y(bl) .j.s;s+(h)+p$1+;???print:??;

Comment on Problem reading sign with XML::Simple
Download Code
Re: Problem reading sign with XML::Simple
by mirod (Canon) on Apr 27, 2005 at 09:20 UTC

    First UTF-8 is perfectly capable of handling £, it's just that it encodes it differently from your original encoding, which you did not specify BTW.

    Did you search the XML::Simple docs for 'encoding'? The second set of hits gives you a possible solution:

    OutputFile => <file specifier> # out - handy The default behaviour of "XMLout()" is to return the XML as a s +tring. If you wish to write the XML to a file, simply supply the filen +ame using the 'OutputFile' option. This option also accepts an IO handle object - especially usefu +l in Perl 5.8.0 and later for writing out in an encoding other than +UTF-8, eg: open my $fh, '>:encoding(iso-8859-1)', $path or die "open($pa +th): $!"; XMLout($ref, OutputFile => $fh);

    Other alternatives include post-processing the output file with iconv, Text::Iconv or using Encode if you are running perl 5.8.*

      I've tried a few different encodings in the original file with the <?xml version='1.0' encoding='blah'?> declaration: iso-8859-1, utf8 and utf16 (although the latter refused to parse the file).

      The data isn't being output by XML::Simple, rather it's coming out such as:

      use XML::Simple; my $x = XMLin('myxmlfile.xml'); print "This costs: " . $x->{item}->{cost} ."\n";

      Where as an example the XML is:

      <?xml version='1.0' encoding='iso-8859-1'?> <catalogue> <item> <cost>300</cost> </item> </catalogue>

      i.e. The values are being pulled out individually and inserted into a new file (which in the real case ends up producing a PDF, but the same behaviour occurs if I drop the values into a plain ASCII file or a HTML doc)

      Now I need to find a version of Text::Iconv that I can use on my system (ActivePerl 5.6.1 (Build 638) MSWin32-x86-multi-thread)

      --- Jay

      All code is untested unless otherwise stated.
      All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
      If in doubt ask.

      s++blah+;y(bl) .j.s;s+(h)+p$1+;???print:??;

        my system (ActivePerl 5.6.1...)
        Unicode handling is seriously hosed in perl versions before 5.8.0. You might want to upgrade.

        Dave.

        You have to understand what's going on here:

        The XML declaration tells the parser in which encoding the input document is encoded. In your case it's probably either ISO-8859-1 (or ISO-8859-15 if you use the € sign) or one of the windows encodings (I can't remember what the names are). So you need the proper declaration for the parser to be able to parse the data, and to make sense of it.

        Then XML::Simple (actually the parser underneath it) converts everything to UTF-8. That's the usual way, so your code (and the parser's) doesn't have to behave differently depending on the input encoding.

        Then you want to output the document in a given encoding, in your case probably the same as the input encoding. This is the step that you are missing.

        With 5.6.1 (which, as mentionned earlier you should really update to 5.8.6) you have to use either Text::Iconv or Unicode::Map8 / Unicode::Strings. A SuperSearch on "character encoding conversion" or "utf8 iso-8859-1" or something like that should give you plenty of ways to do this.

        And of course XML::Twig will let you work with the same encoding as the input ;--)

        Unicode is a 16 bit character set. Utf16 is an encoding where all characters in the input stream are represented with two bytes just as normal integers are represented. The problem with this is that it makes all of the legacy C code (especially present in *NIX systems) choke and die horribly under most circumstances as the encoding normally involves lots of null bytes which the standard libraries cant handle. utf8 is a kludge to prevent these problems. Basically what it does is map the two byte representation to a representation of 1 to 7 chars none of which are ever null (unless the char itself is null), along with a couple of other interesting properties: the seven bit ascii set is valid utf8, and no substring of a normalized valid utf8 character representation is itself a valid character representation (this is useful at times).

        Anyway, the point is that itas pretty unlikely that you are going to work with utf16 encoding very often, although you might find yourself doing so on Win32 architecture as internally Windows uses widechars for everything iirc.

        NOTE: caveat emptor, this is as I remember things working from when I last dealt with unicode in detail i cant promise ive got the details exactly right.

        ---
        demerphq

        Please note that changing the declaration at the top of the file does not magically the encoding of the file. The encoding attribute serves only as a hint to the interpreter that the file is of a specific encoding.

        If you really want to recode your file, use any of the several methods already mentioned, or try the gnu recode utility (under cygwin, since you are on win32).

        The character in question is character 163 (A3 hex). Its binary representation in iso-8859-1 and windows-1252 is A3. However the utf-8 encoding of this character is C2A3. So the character string is being stored (possibly correctly) in UTF8. However when you try to print it, your print command thinks it is normal windows1252 text which is why you see the two characters you do.

Re: Problem reading sign with XML::Simple
by pelagic (Curate) on Apr 27, 2005 at 09:46 UTC
    Have you got a XML Declaration line in your XML document like:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    where you could define the appropriate encoding for or what not?


    Update
    I was obviously late :p

    pelagic

      Actually I've tried it a few ways:

      • Explicitly including in the source document
      • Manually appending it as the data is read
      • Using the options in XML::Parser to force an encoding

      The file did initially come without declarations. The structure itself is a single file with a header and footer lines where each line in between is a distinct XML document (newline delimited obviously).
      The program reads through one line at a time and feeds these lines individually to XML::Simple, where it parses them and outputs the relevant values before discarding the data and reading the next. e.g.

      HEADER0002 <?xml .. ?><document><colour>red</colour><cost>10</cost></document> <?xml .. ?><document><colour>blue</colour><cost>14</cost></document> .... .... etc FOOTER0002

      --- Jay

      All code is untested unless otherwise stated.
      All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
      If in doubt ask.

      s++blah+;y(bl) .j.s;s+(h)+p$1+;???print:??;

Re: Problem reading sign with XML::Simple
by wazoox (Prior) on Apr 27, 2005 at 10:34 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://451866]
Approved by Mutant
Front-paged by pelagic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (12)
As of 2014-10-20 09:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (75 votes), past polls