Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: XML::Fling begone? (ctrl, utf-8)

by tye (Sage)
on Dec 19, 2004 at 19:24 UTC ( [id://416040]=note: print w/replies, xml ) Need Help??


in reply to XML::Fling begone?

When you benchmark, be sure to time the building of a string to output as Genx won't have a handle to write to.

One problem with XML 1.0 is that they made some stupid decisions with regard to control characters. This is likely fixed in the next version of the XML spec (which I assume is still not finished).

In my experience, the majority of XML parsers are actually non-complient on this point (perhaps a form of civil disobedience or a subconscious revolt against a design misfeature?) so producing non-complient XML has a practical advantage for me. If Genx is complient on this point, then that will probably be too much thrash to be worth the minor benefit.

When XML 1.1 becomes available, then the stupid design decision is restricted to nul characters, which is an acceptable compromise. Which means that using Genx and letting the user select which version of XML they want output would be great.

Only being able to produce UTF-8 may have some interesting consequences. We have a hard time getting people to deal with encodings with XML correctly. The change will likely cause some disruption. It may ease some problems. For example, cbhistory still produces UTF-8 output but claims it is Latin-1 (because it feeds Latin-1 to its XML parser but the parser insists on producing UTF-8 output and the author didn't appreciate this fact). So such a change might fix this problem and/or may cause it to appear more places. I just mention this in hopes that this somewhat minor point will be properly addressed if a change is made.

- tye        

Replies are listed 'Best First'.
Re^2: XML::Fling begone? (ctrl, utf-8)
by Aristotle (Chancellor) on Dec 19, 2004 at 21:30 UTC

    Please elaborate on control characters. I have a vague recollection of hearing something like that before but I can't pull out the specifics. And, handwaving the issue before I actually know what it is, is this something CDATA sections or entitification cannot fix in generally compatible fashion?

    Makeshifts last the longest.

      No, the XML 1.0 spec declares that non-whitespace control characters (with or without the eighth bit set) are illegal in XML and entities for illegal characters are illegal.

      I don't know it CDATA removes this restriction. I'd think it would but after being surprised by the control-character stupidity and seeing many XML near-experts also boggled by it, I won't speculate w/o reading the spec first.

      Of course, if you prefer the attribute-heavy style of XML, then CDATA won't be any help (I say w/o verifying this assumption but I'd nearly bet money on it).

      I feel PerlMonks' XML should be nearly or completely attribute-free. But that isn't much help since we already have a heavy base of ticker clients that don't handle CDATA.

      So when I said that control characters are a problem, I wasn't so XML-naive as to not have considered entities and CDATA.

      - tye        

        Ah. No, CDATA apparently doesn't help, and it would indeed be useless with attributes.

        And indeed, Genx refuses to put control characters in the output stream.

        However, I found that an AddText call with an empty string seems to consistently coerce Genx to flush whatever it had in buffer. At that point I can sneak anything I want into the stream before I resume business as usual:

        my $str = ''; my $w = XML::Genx->new; eval { $w->StartDocSender( sub { $str .= shift; } ); $w->StartElementLiteral( 'foo' ); $w->AddText( 'bar' ); $w->AddText( '' ); $str .= chr 1; $w->AddText( 'baz' ); $w->EndElement; $w->EndDocument; }; die "Writing XML failed: $@" if $@;

        This works as expected and allows to send control characters in node content, though not in attributes.

        Another alternative whose viability I can't tell is that Genx provides a genxScrubText call which simply brushes out anything illegal. Would that be acceptable? (I do wonder why we need to make it possible to send control characters in the tickers.) However, the problem here is that XML::Genx currently doesn't bind that function.

        As for XML style, I agree that attributes should be avoided. I didn't understand this when I first learned of XML, but I've come to appreciate why it is common wisdom among more insightful people. Mixed content is also a pain when you're dealing with structured rather than “document-ish” data.

        Makeshifts last the longest.

        FYI, here's Tim Bray's explanation of the reasoning for that:

        The only characters that XML dislikes are ASCII C0 control characters such as form-feed, vertical-tab, and those wonderful things like EOT and DLE and NAK and SYN, which have exactly zero shared semantics from system to system; which is exactly why they're not in XML.

        Update: just to be clear, I am not supporting the argument — nor rejecting it. My only actual experience is limited to systems with very little variation: Unix vs Windows on the same hardware platform. I haven't even worked on MacOS X. So I don't know enough to make any argument here.

        Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://416040]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-19 01:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found