Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: UTF-8 and XML::Parser

by grantm (Parson)
on Oct 15, 2012 at 20:48 UTC ( #999177=note: print w/ replies, xml ) Need Help??


in reply to UTF-8 and XML::Parser

All of the XML parser modules expect raw bytes of XML as input. Therefore your results may differ if you parse from a file or open filehandle rather than from a string - it all depends on how the data got into the string.

If you pass a string of XML to XML::Parser you need to be sure that it is a byte string and not a character string. So the anonymous monk's suggestion to 'use utf8;' is exactly the wrong thing in this case - it would convert all non-ASCII literal strings in your script to Perl's internal character string representation. To convert from that to a form that an XML parser can read you'd need to use something like:

my $bytes = Encode::encode_utf8($string);

Perl's internal character string representation is similar to but not exactly the same as UTF8. In particular, some characters in the range U+0080 to U+00FF are represented as a single byte (the ISO8859-1 form) instead of the 2 bytes you'd expect from UTF8.


Comment on Re: UTF-8 and XML::Parser
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://999177]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (13)
As of 2015-07-03 11:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (51 votes), past polls