Re: XML and entities, what am I doing wrong?

The real problem is 'The Unicode'

Most Perl XML modules are built on top of expat, or XML::Parser which is an interface to expat. Expat is XML parser. It will get your XML (XHTML) document and process its tags and so on. But as XML is fundamentaly based on unicode, expat will convert all your characters to unicode. For this conversion to work properly, you should have valid encoding specified in XML header: <?xml version='1.0' encoding='iso-8859-2'?> This is the primary reason for these odd charaters you encounter. They are utf-8 (8-bit Unicode) representation of non-english characters.

You probably want to avoid this coversion. I have similar problem maybe a year ago, but found no useful solution. XML::Parser has a original_string method which returns character data in original encding, but it wont expand entities. And there is no way to get attributes in original encoding. Best solution around this is to use Unicode::Map8 to map all unicode strings back to their original encodig, but this is terribly slow solution for frequent use.

So I wrote my own poor man's XML parser based on Perl patterns. But it is not a solution, but a hack. If you plan to use XML, use should better move to Unicode completly.

PS: I wonder how XML::Twig implements its keep_encoding option. By forcing expat to behave reasonably or by back conversion to original charset?

Comment on Re: XML and entities, what am I doing wrong? Select or Download Code

Replies are listed 'Best First'.

Re: Re: XML and entities, what am I doing wrong?
by mirod (Canon) on Jun 08, 2001 at 17:59 UTC

XML::Twig uses the original_string method to keep the characters in the original encoding (but then it works only for 1-byte encodings as it uses a regexp to parse the start tag string to extract the tag name and the attributes). In order to track the entities (and not expand them) I use a Default handler that spots them and stores them as a special element.

The latest (still beta) version also comes with a bunch of filters, to convert the UTF-8 back to latin1, html-style text (using HTML::Entities), DOM-style ASCII + character entities or to any other encoding using either the Unicode::Map8 or (even better if the iconv library is installed on your system) Text::Iconv.

Overall using the original_string method, even though it is frowned upon as not being completely kosher is the easiest choice if (IF) you are using a 1-byte encoding. Dealing with the various cases on internal and external entities (depending on whether they are defined at the beginning of the document or in a separate file) is way trickier and entities within attributes are generally a huge pain to deal with using XML::Parser.

[reply]


The stupid question is the question not asked
	PerlMonks