Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Funny characters in nodes

by dmitri (Curate)
on Jul 08, 2007 at 20:29 UTC ( #625530=monkdiscuss: print w/ replies, xml ) Need Help??

Esteemed Brethren,

as you may know, we (creamygoodness, dpavlin, and I) are working on a full-text perlmonks search. Corion provided us with the first 100,000 nodes in XML format. What I have noticed is that there are about 40 nodes whose XML cannot be parsed dude to funny characters (see example).

This brings up several questions:

  • How do those funny characters get in there in the first place? Should we strip them out as user post things?
  • How did Corion create the XML nodes (which library did he use) and which library does perlmonks.org use to display nodes in XML?
  • How should the MonkSearch team proceed? Would it be wise to just strip out those characters before parsing XML out?
  • Shouldn't XML encode those characters somehow? Something like &0x1234; or however HTML entities are handled?

Thanks,
  - Dmitri.

Full list:

657 672 5132 5549 5767 7641 7888 10277 11869 15824 16378 19648 21510 21645 22045 26561 27261 29149 32115 32715 33870 35485 36797 36962 37826 42307 51955 52426 62887 75822 79160 85146 87117 87271 87991 89029 91802 94571 96186 97045

Comment on Funny characters in nodes
Re: Funny characters in nodes
by Corion (Pope) on Jul 08, 2007 at 20:52 UTC

    I produced these nodes basically by bypassing the complete webserver and directly running the code that creates the XML. That XML is produced by (a hacked version of) XML::Fling - I don't know how it handles well-formedness and how it relates to "real" XML.

    Update: Hmmm - I just noticed - XML::Fling does not exist on CPAN, so likely it was written by one of us... Anyway - I guess it only encodes the characters necessary to ensure proper XML-escaping, that is, anything matching /[<>&]/.

      I could not find that library on CPAN:

      http://search.cpan.org/search?query=Fling&mode=all

      Update: oh, I see :)

Re: Funny characters in nodes
by holli (Monsignor) on Jul 08, 2007 at 21:35 UTC
    Strip them out. Nobody will search for a funny character, so it makes no sense to keep them.


    holli, /regexed monk/
Re: Funny characters in nodes
by jdporter (Canon) on Jul 08, 2007 at 22:11 UTC

    Leave them. And decode any encoded entites you find. People will not be searching for encoding strings, they'll be searching for raw "real" text. The XML representation of the nodes' content (and any errors arising therefrom) should be irrelevant.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
      I will obviously decode encoded entities -- my problem is that the standard XML libraries I have choke on those characters. I'd hate to write my own XML parser...

        Wrap the error in eval?

        Steve
        --
Re: Funny characters in nodes (exactly zero)
by tye (Cardinal) on Jul 08, 2007 at 22:28 UTC

    The XML standard is stupidly broken because the designers made proclamations like Tim Bray's: "XML dislikes [...] form-feed[s] [etc.] which have exactly zero shared semantics from system to system". Yes, we all know that no two people in the world ever used form-feed for the same thing. I can't even guess what anybody else would use it for. But surely not to represent a page break, since that is my personal use for it and there is exactly zero shared semantics for that character so nobody else uses it for that.

    And the XML mindset of "we need the standard to require fatal errors for things that we dislike that others will surely see value in; otherwise, people will actually make XML useful by doing things we don't like" meant that XML 1.0 very thoroughly made sure that there was no reasonable way to get a form-feed character sent.

    So the only choices you have when you have data containing a form-feed character are

    1. Ignore that one part of the XML 1.0 standard and send the form-feed character anyway
    2. Strip any characters that Tim Bray doesn't personally like and hope that they weren't important
    3. Come up with some proprietary way of encoding data into characters that Tim Bray doesn't dislike and force anyone consuming your data to read the XML standard and your personal "this is how to decode my characters" specification

    Not surprisingly (to me, anyway), many XML parsers have actually chosen the first option above and the draft XML 1.1 standard even sees the light except in the case of nul characters (which we should be able to send as &#0; but I doubt even XML 1.1 will overcome previous stupidity to that extent).

    So, to the horror or severe disappointment of some people, PerlMonks XML generation also defaults to option 1 above. This needs to be changed but nobody has ponied up the code to make option 2 the default instead, so it must not be too big of a deal. Certainly, stripping control characters out of the XML from PerlMonks is quite simple and then allows any compliant XML parser to be used on it.

    So just do that (strip them). Or, if you want to preserve control characters despite Tim's dislike, come up with your own private encoding, encode them, parse the XML, then decode them. Or find a more tolerant near-XML parser.

    Update: Just for the sake of completeness, I should mention that encoding each form-feed character as &#12; is an interesting-sounding option but it is also forbidden by the XML 1.0 standard and so has no advantages over violating the standard more simply by just leaving them directly in the XML. Indeed, the main difference of such an act would be making it more complicated to strip out disliked characters to make the XML fully compliant.

    - tye        

      Most of the characters that caused problems that I looked at can be safely ignored. They are not just linefeeds, however. What I'm afraid of is that they may be some multi-byte characters that make sense in another characters set (especially since perlmonks.org uses Latin1 and not UTF-8).

        Then do option 2 or 3. Option 2 is pretty simple:

        s/(\\)|([...])/ $1 ? "\\\\" : sprintf "\\%02X", chr $2 /ge; my @elements= parseXML(); s/\\(\\|..)/ length $1 == 1 ? "\\" : chr hex $2 /ge for @elements;

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://625530]
Approved by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2014-09-21 02:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (165 votes), past polls