Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

XML Simple Charset Q?

by dingus (Friar)
on Nov 25, 2002 at 17:33 UTC ( #215678=perlquestion: print w/ replies, xml ) Need Help??
dingus has asked for the wisdom of the Perl Monks concerning the following question:

I have some very simple XML that I want to convert into a hash (sample below):
<rec id = 'F600' type = 'J'> <author>A. S. Bommarius, K. Drauz, W. Hummel, M.-R. Kula, C. Wandrey</ +author> (snippage) </rec>
Most of the time XML::Simple converts this just fine. Unfortunately if one of the author names contains an umlaut ( or or ) then I get this error from XML::Parser (presumably called by XML::Simple):
"Error: not well-formed at line 2, column 18, byte 49 at c:/Perl/site/lib/XML/Parser.pm line 168"
where the character specified is the umlauted character.

I'm not finding an obvious way to tell XML::Simple that umlauts are OK (i.e. that I'm using ISO Latin-1). I would prefer not to have to do s//&#246;/ lines before I use XML::Simple although that clearly is a possibility.

FWIW
XML::Simple Version: 1.06;
XML::Parser Version: 2.27;

Dingus


Enter any 47-digit prime number to continue.

Comment on XML Simple Charset Q?
Select or Download Code
Re: XML Simple Charset Q?
by mirod (Canon) on Nov 25, 2002 at 17:44 UTC

    You have to tell the XML parser used by XML::Simple that your data is in ISO-8859-1 (that's latin1 for the rest of us), otherwise your data is NOT XML.

    Add this XML declaration at the top of your XML file:

    <?xml version="1.0" encoding="ISO-8859-1"?>

    But don't think that's enough... the parser (expat) will convert your data to utf8, so when you output it you might want to convert it back to latin1. Look at Unicode and locales for a recent thread on the subject.

      The problem here is I'm trying to process many such snippets of XML for output as HTML. I suspect its easier in this case to do a substitution regex with the /e parameter instead of going through all the mungeing back from UTF-8.

      s/([\x80-\xff])/'&#'.ord($1).';'/eg
      appears to work for all the characters I care about.

      Update XML::Parser still insists on converting &#NNN; to UTF-8! I didn't notice as mozilla cunningly noted the changed page encoding and displayed automagically as UTF-8. Mutter Mutter Curse Curse - this is a major pain as I'd like the page to remain Latin-1.

      Dingus


      Enter any 47-digit prime number to continue.
        Since the codes for Latin-1 are the same as Unicode for the first 256 values, that should work (you need to re-encode the values but don't need to translate them though a table). That is, if "use utf8" is not in scope when the regex is compiled. I don't know about Perl 5.8, which reportedly doesn't need the utf8 pragma—you might need some other way to refer to those character on the input.

        Anyway, you can use the same light-weight trick to convert back. s/([\x{80}-\x{ff}])/pack('C',$1)/eg Compiled with utf8 in effect (note the curlies on the \x codes. This indicates UTF-8 encoded characters). Then use pack instead of chr so you can specify bytes (chr does too much DWIMary and the persuasion thing is not as transparant as one would hope when dealing with I/O, though I think it's behavior in 5.6 would work in this case).

        —John

Re: XML Simple Charset Q?
by pg (Canon) on Nov 25, 2002 at 18:17 UTC
    Yes, you can use umlauts in your xml, and XML::Parser is okay with them. Just do two things:
    1. When you new your XML::Parser, specify  ProtocolEncoding => "Latin-1"
    2. If you don't have a file called Latin-1.enc under your XML/Parser/Encodings directory, get it from somewhere or make one for yourself. If you already have it, you are ready to go now.
      1. Where the heck do I find a latin-1.enc file? google is ot my friend right now :(

      2. Does this end up with UTF-8 output anyway? - see my update to my reply to mirod above.

      Dingus


      Enter any 47-digit prime number to continue.
      Any advice on where to find these protocol/encoding sections, or how they should look?

      I spend a lot of time tacking on the headers as suggested earlier in the thread, and I'd like to learn a little more about how expat and XML::Parser deal with encodings -- specifically, how they're mapped.

      Suggestions?

      If you don't have a file called Latin-1.enc under your XML/Parser/Encodings directory, get it from somewhere or make one for yourself. If you already have it, you are ready to go now.

      Actually there is no such file in the Encodings directory and there is no need for one. ISO-8859-1 is understood by expat natively:

      From XML::Parser doc:

      ProtocolEncoding
                     This is an Expat option. This sets the protocol encoding name.
                     It defaults to none. The built-in encodings are: "UTF-8",
                     "ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be
                     used if they have encoding maps in one of the directories in
                     the @Encoding_Path list. Check the section on "ENCODINGS" for
                     more information on encoding maps. Setting the protocol encod-
                     ing overrides any encoding in the XML declaration.
      

      Please, please, please do not use the ProtocolEncoding option. As mirod said, if your source XML document a) does not declare an encoding and b) is not UTF8 (or UTF16) encoded, then it is not XML! The two preferred options are:

      • If you are generating the XML, then you need to include an XML declaration which specifies the encoding
      • If the XML is being generated by someone else, then you need to reject it since it is not well formed.

      Sure, you might guess that the encoding is ISO-8859-1 and it might seem to work if you force it with ProtocolEncoding, but the encoding might actually be CP1252 and the differences haven't tripped you up - yet.

      The encodings section of the Perl XML FAQ may be useful.

Re: XML Simple Charset Q?
by mirod (Canon) on Nov 25, 2002 at 18:53 UTC

    OK, so first your version of XML::Parser is _OLD_. Keep it only if you are on Windows (PPM depends on it and it might not be wise to change it).

    Then search the site for ways to convert from utf8, there are plenty that work (Encode with 5.8.0, Text::Iconv if you have it, Unicode::* ...

    Then (sorry grantm, I did not want to push XML::Twig but they are forcing me too ;--) you can always use XML::Twig with the keep_encoding option that will keep the data in its original encoding.

OT: Re: XML Simple Charset Q?
by talexb (Canon) on Nov 26, 2002 at 15:36 UTC

    This is perhaps off-topic, but I was wondering why your XML is not as follows:

    <rec id = 'F600' type = 'J'> <author>A. S. Bommarius</author> <author>K. Drauz</author> <author>W. Hummel</author> <author>M.-R. Kula</author> <author>C. Wandrey</author> (snippage) </rec>
    --t. alex
    but my friends call me T.
      Because its output from endnote and I'd have to go and split the single author field that I get. Since, for the application I'm writing, we don't want to sort by author, just search on and display the author list, I can't be bothered to split the field up and then have to reintegrate it for the display.

      (Its a good question though - and I have thought about it, if I get my other XML entity question sorted I may revisit this as there could be sme advanatage if I did this and used XML Twig)

      Dingus


      Enter any 47-digit prime number to continue.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://215678]
Approved by pg
Front-paged by pg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-10-01 01:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (386 votes), past polls