Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I've dealt with various aspects of this problem at different times, so let me take a stab here...

The first option that comes to mind is this: if XML::Simple can handle your character set, and your character set is an acceptable one for web browsers (such as ISO-Latin-1), why not just use the raw characters? Most browsers that can display the characters correctly at all can handle that character set, as far as I know.

However, I'll try to answer the opposite question as well (can't hurt, and might just be helpful).

The problem you're having is that XML::Simple does not recognise the entities you're passing it in your XML source. This is entirely appropriate--as far as I know, XML::Simple only understands basic XML entities, of which (again, as far as I know) there are very few: only & < and > (&amp; &lt; and &gt;) spring to mind. Therefore, when it encounters something like &auml;, which is unquestionably an entity but not one it's familiar with, it does what every good XML parser does when it finds something unexpected: die.

The obvious solution to this is to tell the parser to recognize your entities, but there are two objections:

  1. that could easily get rather un-simple
  2. that would definitely defeat your original purpose

Why this last? Well, when the XML::Simple spits out your parsed data, it has already translated the entities in its input to the corresponding character data (much as the web browser will with the HTML entities). Which leaves us right where we started, really--if you can handle outputting to the browser, then just put it in your XML source to begin with.

However, this suggests the solution that I personally have used for this problem the few times I've encountered it: double-escape the data going into your XML source. That is, if you want to parse your XML and have it contain the string "&eacute;", arrange for your XML source file to contain the string "&amp;eacute;". The alternative is to enclose the relevant sections in CDATA tags, which is acceptable for some things (including wholesale HTML markup in XML files) but generally overkill, in my opinion.

To actually do this programatically (assuming you're dealing with input that includes the literal characters you're trying to escape), you're probably best off with HTML::Entities, as mentioned above: it's distributed with HTML::Parser but does not partake of the weightyness of that module (or its need for compilation). If you have it installed, then something along these general lines should do the trick:

use HTML::Entities; while (<TEXT_FILE>) { encode_entities $_;
encode_entities $_; # yes, really twice
do_stuff($_); } print XMLout ($foo); # the data structure built by do_stuff()

Possibly the lamest code example I've ever posted, that... I do suggest that comment, though, for the benefit of your associates and successors. If that doesn't encode all the characters you need encoded, check out the other parameters to that function--it can do what you need done.

Good luck!

Update: added print line to snippet, in a possibly doomed attempt to make it resemble actual code.

Update: doh! Working too hard and thinking too little--XMLout does, of course, escape XML entities, so only one round of HTML escaping is called for (if you're using XMLout). Thanks to ajt for the catch!



If God had meant us to fly, he would *never* have given us the railroads.
    --Michael Flanders


In reply to Re: , , and XML::Simple by ChemBoy
in thread , , and XML::Simple by BioHazard

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (6)
    As of 2020-05-30 12:14 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      If programming languages were movie genres, Perl would be:















      Results (171 votes). Check out past polls.

      Notices?