Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Problem reading £ sign with XML::Simple

by gothic_mallard (Pilgrim)
on Apr 27, 2005 at 09:45 UTC ( [id://451878]=note: print w/replies, xml ) Need Help??


in reply to Re: Problem reading £ sign with XML::Simple
in thread Problem reading £ sign with XML::Simple

I've tried a few different encodings in the original file with the <?xml version='1.0' encoding='blah'?> declaration: iso-8859-1, utf8 and utf16 (although the latter refused to parse the file).

The data isn't being output by XML::Simple, rather it's coming out such as:

use XML::Simple; my $x = XMLin('myxmlfile.xml'); print "This costs: " . $x->{item}->{cost} ."\n";

Where as an example the XML is:

<?xml version='1.0' encoding='iso-8859-1'?> <catalogue> <item> <cost>£300</cost> </item> </catalogue>

i.e. The values are being pulled out individually and inserted into a new file (which in the real case ends up producing a PDF, but the same behaviour occurs if I drop the values into a plain ASCII file or a HTML doc)

Now I need to find a version of Text::Iconv that I can use on my system (ActivePerl 5.6.1 (Build 638) MSWin32-x86-multi-thread)

--- Jay

All code is untested unless otherwise stated.
All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
If in doubt ask.

s++blah+;y(bl) .j.s;s+(h)+p$1+;???print:??;

Replies are listed 'Best First'.
Re^3: Problem reading £ sign with XML::Simple
by mirod (Canon) on Apr 27, 2005 at 10:40 UTC

    You have to understand what's going on here:

    The XML declaration tells the parser in which encoding the input document is encoded. In your case it's probably either ISO-8859-1 (or ISO-8859-15 if you use the € sign) or one of the windows encodings (I can't remember what the names are). So you need the proper declaration for the parser to be able to parse the data, and to make sense of it.

    Then XML::Simple (actually the parser underneath it) converts everything to UTF-8. That's the usual way, so your code (and the parser's) doesn't have to behave differently depending on the input encoding.

    Then you want to output the document in a given encoding, in your case probably the same as the input encoding. This is the step that you are missing.

    With 5.6.1 (which, as mentionned earlier you should really update to 5.8.6) you have to use either Text::Iconv or Unicode::Map8 / Unicode::Strings. A SuperSearch on "character encoding conversion" or "utf8 iso-8859-1" or something like that should give you plenty of ways to do this.

    And of course XML::Twig will let you work with the same encoding as the input ;--)

Re^3: Problem reading £ sign with XML::Simple
by dave_the_m (Monsignor) on Apr 27, 2005 at 10:20 UTC
    my system (ActivePerl 5.6.1...)
    Unicode handling is seriously hosed in perl versions before 5.8.0. You might want to upgrade.

    Dave.

Re^3: Problem reading £ sign with XML::Simple
by demerphq (Chancellor) on Apr 27, 2005 at 15:51 UTC

    Unicode is a 16 bit character set. Utf16 is an encoding where all characters in the input stream are represented with two bytes just as normal integers are represented. The problem with this is that it makes all of the legacy C code (especially present in *NIX systems) choke and die horribly under most circumstances as the encoding normally involves lots of null bytes which the standard libraries cant handle. utf8 is a kludge to prevent these problems. Basically what it does is map the two byte representation to a representation of 1 to 7 chars none of which are ever null (unless the char itself is null), along with a couple of other interesting properties: the seven bit ascii set is valid utf8, and no substring of a normalized valid utf8 character representation is itself a valid character representation (this is useful at times).

    Anyway, the point is that itas pretty unlikely that you are going to work with utf16 encoding very often, although you might find yourself doing so on Win32 architecture as internally Windows uses widechars for everything iirc.

    NOTE: caveat emptor, this is as I remember things working from when I last dealt with unicode in detail i cant promise ive got the details exactly right.

    ---
    demerphq

Re^3: Problem reading £ sign with XML::Simple
by alexz (Beadle) on Apr 27, 2005 at 18:19 UTC
    Please note that changing the declaration at the top of the file does not magically the encoding of the file. The encoding attribute serves only as a hint to the interpreter that the file is of a specific encoding.

    If you really want to recode your file, use any of the several methods already mentioned, or try the gnu recode utility (under cygwin, since you are on win32).

    The character in question is character 163 (A3 hex). Its binary representation in iso-8859-1 and windows-1252 is A3. However the utf-8 encoding of this character is C2A3. So the character string is being stored (possibly correctly) in UTF8. However when you try to print it, your print command thinks it is normal windows1252 text which is why you see the two characters you do.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://451878]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-26 05:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found