comment on

All of the XML parser modules expect raw bytes of XML as input. Therefore your results may differ if you parse from a file or open filehandle rather than from a string - it all depends on how the data got into the string.

If you pass a string of XML to XML::Parser you need to be sure that it is a byte string and not a character string. So the anonymous monk's suggestion to 'use utf8;' is exactly the wrong thing in this case - it would convert all non-ASCII literal strings in your script to Perl's internal character string representation. To convert from that to a form that an XML parser can read you'd need to use something like:

  my $bytes = Encode::encode_utf8($string);
[download]

Perl's internal character string representation is similar to but not exactly the same as UTF8. In particular, some characters in the range U+0080 to U+00FF are represented as a single byte (the ISO8859-1 form) instead of the 2 bytes you'd expect from UTF8.

In reply to Re: UTF-8 and XML::Parser by grantm
in thread UTF-8 and XML::Parser by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks