Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Converting HTML special entities to XML

by drhender (Initiate)
on Sep 01, 2004 at 18:33 UTC ( #387639=perlquestion: print w/ replies, xml ) Need Help??
drhender has asked for the wisdom of the Perl Monks concerning the following question:

I am using perl to parse html and convert it to XML. Slowly, over time, I have built up a table of HTML special entities (©,  , £, etc.) that I have to convert to hex values before putting them in the XML. Does anyone know if there's a module lying around somewhere that would do that conversion for me, or should I just still to use the look up table?

Comment on Converting HTML special entities to XML
Re: Converting HTML special entities to XML
by Aristotle (Chancellor) on Sep 01, 2004 at 18:37 UTC
      I think it is better to translate them to character references. The entities can't be represented accurately other than with Unicode. The HTML entity resolver would need to produce UTF-8 strings.

      This assumes that the HTMl to XML process is converting escaped text to escaped text. If the text is being unescaped for other reasons, then the entities should be expanded to UTF-8 and escaped on output.

        They should always be expanded to UTF-8 and escaped on output. Your HTML parser should just give you Unicode, and whatever XML generator you use should be escaping it automatically for you as appropriate for the target encoding.

        Don't attempt to transcode entities and what manually to insert literal bytes into the output XML stream. That way lies madness (and a lot of buggy code; most code dealing with XML out there is quite broken with regard to encodings).

        Makeshifts last the longest.

Re: Converting HTML special entities to XML
by iburrell (Chaplain) on Sep 01, 2004 at 22:32 UTC
    Look at the entity declarations in the XHTML or HTML specs. Those are what real SGML/XML processors use to translate the entities into character references.

    http://www.w3.org/TR/xhtml1/#h-A2 has links to the DTD files for XHTML1.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://387639]
Approved by Aristotle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (13)
As of 2014-07-31 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls