Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

XML::Parser and &entity;

by dingus (Friar)
on Nov 26, 2002 at 15:48 UTC ( #215850=perlquestion: print w/ replies, xml ) Need Help??
dingus has asked for the wisdom of the Perl Monks concerning the following question:

This is a follow up question to the one yesterday (XML Simple Charset Q?) about parsing data.

As well as some latin-1 accented characters, I also have some (valid) html entities such as ± ( ± ) and ≤ ( ≤ ). Unfortunately, no matter what I do, my XML::Parser always barfs on these entities. I've changed the charset I use in the file or passed as a raw parameter (or even when I'm playing with XML::Twig using keep_encoding) and changed the top level XML package (either XML::Simple or XML::Twig) to no avail.

My HTML::Entities correctly recognised and converts the encodings so that's presumably not the issue.

Is this an XML Parser bug (and if so or is it due to anold version of XML::Parser)? or am I just completely misinderstanding something? or? I would greatly prefer not to have to manually convert these entities before handing off to the parser and ideally I'd like them untouched since they don't need to be changed by any of the parsers.

Versions
HTML::Entities Version: 1.23;
XML::Parser Version: 2.27;

Sample always failing code

use Data::Dumper; use XML::Simple; my $xml= <<EOXML ; <rec id = 'F600' type = 'J'> <author>A. S. B&#245;mmarius, K. Drauz, W. Hummel, M.-R. Kula, C. Wand +rey</author> <text>&lt; &amp; &ge; &#242; &plusmn;</text> </rec> EOXML $xmlref = XMLin($xml);
Error: undefined entity at line 3, column 17, byte 129 at c:/Perl/site/lib/XML/Parser.pm line 168
Line 3 col 17 appears to be the &ge;

Dingus
PS This entity and accent crud is almost enough to make me use regexes for XML parsing :)


Enter any 47-digit prime number to continue.

Comment on XML::Parser and &entity;
Download Code
Re: XML::Parser and &entity;
by mirod (Canon) on Nov 26, 2002 at 16:45 UTC

    Your data is still not well-formed XML. The only pre-defined entities in XML are &lt;, &amp;, &gt;, &quot; and &apos;. Numerical entities, &#nb; or &#xnb; are also allowed. Everything else needs to be declared.

    So you have several options here:

    • you can just turn all those entities into their numerical equivalent (for example &le; is&#8804;); this will be a pain, error-prone, and will not take advantage of XML power (which is a mortal sin in certain circles!),
    • you can include all of the entity definitions you need in the document:
      <!DOCTYPE rec [ <!ENTITY le "&#8804; > <!-- many more entity declarations--> ]> <rec> &le;</rec>
      this has the advantage that your documents are standalone (they don't depend on external files) but an empty document might be quite big,
    • you can also go all the (XML) way and include the entity declarations through... entities, the way it's done in the HTML DTD:

      <!DOCTYPE rec [ <!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "xhtml-lat1.ent"> %HTMLlat1; <!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" "xhtml-symbol.ent"> %HTMLsymbol; <!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" "xhtml-special.ent"> %HTMLspecial; ]> <rec> &le;</rec>
      you can get the entity declaration files from the W3C.

    You could also internally use the entity declaration files and pre-process the XML to convert them to numerical entities. I can't think of an easy way to do this right now (except using xmlwf -p -d result_dir file.xml but then the output is in utf-8) but I'll have a look at it.

    Update: fixed typo in doctype

      The easiest way I have found to resolve the entities, replacing them by the numerical entity, and to drop the now useledd DTD is to use xmllint, from libxml2:

      xmllint --noent --dropdtd <file.xml>
      Just to get this 100% clear in my not very XMLized head.

      An valid XML file with no includes/inline entity definitions may contain:

      • Valid unicode utf-8 characters
      • &lt;, &amp;, &gt;, &quot;, &apos;.
      • Numerical entities, &#nb; or &#xnb;
      and nothing else?

      The good news is that I do in fact control the source data file so I can do further mungeing. It looks like the best option is to utf-8 the file including convertig to utf-8 the entities that are not defined. Then, since the characters are in fact all valid latin-1 doing my favourite pack/unpack trick to convert UTF-8 back to latin-1 for the display

      sub utf8toNative() { my $c = pack("C*",unpack("U*",$_[0])); return ((length($c)==length($u))?$_[0]:$c);
      (You have to return the string unchanged if the lengths are the same as new string may be incorrect in such cases)

      Dingus


      Enter any 47-digit prime number to continue.

        utf-8-ing everything will indeed save you some headache. That will be playing along with the "XML Way", instead of fighting it. Just to be complete though: you can use an other encoding if you specify it in the xml declaration (<?xml version="1.0" encoding="ISO-8859-1"?>. XML::Parser based modules will nevertheless convert the input to utf-8 before passing it to your code.

Re: XML::Parser and &entity;
by Anonymous Monk on Nov 26, 2002 at 16:54 UTC
    the same style of paths

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://215850]
Approved by thraxil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-12-18 03:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (41 votes), past polls