Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

is XML too hard?

by thraxil (Prior)
on Mar 17, 2003 at 18:06 UTC ( [id://243725] : perlmeditation . print w/replies, xml ) Need Help??

Tim Bray, one of the inventors of XML, has an interesting weblog post exploring the idea that XML is too hard for programmers.

one of his main points is that programmers seem to be stuck either using an inefficient approach that involves parsing the entire document and keeping it in memory (DOM), or writing code in an awkward callback style (XML::Parser,SAX,etc) that doesn't mesh well with the programming language being used.

he also makes a good case for why having a language that is specifically designed to work with XML isn't as good an idea as it sounds.

the real bombshell of the piece is that he uses regexps to do most of his XML parsing. here at the monestary, whenever a young monk posts code that uses regexps to parse XML, we admonish them and dutifully point them at some of the more robust XML modules. for good reason. there is a world of difference between the inventor of XML, who has been writing Perl since 1993 using regexps for convenience and a green newbie using regexps because they aren't aware of the gotchas or that better modules exist.

still, when the inventor of XML suggests that the existing approaches are too complicated, maybe we ought to pause for a moment to think about that.

for most of what i do with XML, i'm only dealing with small documents and performance isn't critical. in these situations, XML::Simple and XML::Twig make things easy and painless. i've only had to do callback or stream based parsing a couple times and, while i didn't find it that hard, i can see how it would be difficult to deal with in more complex applications.

i've been playing with HTML::TokeParser lately for scraping websites. i've found it to be a very intuitive and powerful interface. perhaps something similar could be done for XML. i see no reason that it couldn't be implemented with a stream-based backend keeping it efficient for parsing large documents.

what else can we think of to make working with XML less painful?

anders pearson

Replies are listed 'Best First'.
Re: is XML too hard?
by Steve_p (Priest) on Mar 17, 2003 at 19:58 UTC

    When working with XML, I haven't found that my programming style has changed at all, its simply the methods (or functions) I use to process the file. I have seen many cases, however, where I have seen programmers make life much more difficult for themselves by using XML.

    For example, in Java, there is a wonderful Properties API that allows you to store and use simple key=value pairs in Properties file. Rather than using this, however, I have seen a disturbing trend by Java programmers toward using XML configuration files. The result I've seen is programmers using XML parsers to parse their config files and then never using it again in the program. The result is a program with external dependencies that is larger and slower than it needs to be to perform something that is done rather easily by using the base API. Java is not, unfortunately, the only place that I've seen this happening. The rush to replace all file use with XML has been an unfortunate side effect of XML, and has caused many programmers to use it in cases where other formats, including POFF (plain-old flat files) would be sufficient.

Re: is XML too hard?
by perrin (Chancellor) on Mar 17, 2003 at 19:17 UTC
    It is so tempting to tell this guy "you made the bed, now f'ing sleep in it!" Why can't you use a simple line-oriented flat-file model on XML? Because they designed it to work differently, and he was in on the design!

    However, he doesn't make much of a case for the problem with using SAX. What's the real difference between using callbacks and writing a while (<STDIN>) loop? Not that much, in my opinion.

      What's the real difference between using callbacks and writing a while (<STDIN>) loop?

      I suspect the question was rhetorical but I'll bite anyway :-) Put simply, it's the difference between push and pull. If you're used to reading a file line-by-line, you're used to thinking in 'pull' mode - you tell the parser "give me the next thing". Using SAX or the XML::Parser handler style requires you to think in 'push' mode - the parser tells you when it has something interesting. (In Soviet Russia the XMLs parser call you)

      I think Matt and the team have done a great job with XML::SAX. Sub-classing XML::SAX::Base allows you to very easily write code which concentrates on the bits you're interested in. However, to use this stuff you have to think in a different way. Rather than saying "give me this and I'll deal with it", you have to say something more like "I'm interested in 'x', when you find an 'x', give it to this routine which knows how to deal with it".

      People are saying good things about using HTML::TokeParser to achieve a pull-style interface to XML. I haven't tried it myself yet - my needs are simple :-)

Are you looking at XML processing the right way?
by dragonchild (Archbishop) on Mar 17, 2003 at 21:07 UTC
    Thus we'd be forced to use parser callbacks of one kind or another, which is sufficiently non-idiomatic and awkward ...

    Personally, I'd like to see this comment explained more. "non-idiomatic"? I believe that tye and tilly (among many others) have made case that callbacks are not "non-idiomatic" in Perl. Perl is as functional as it is procedural (and heck of a lot more of both than it is OO). I would posit that the venerable Mr. Bray is less comfortable with functional logic than he is will XML. *

    Writing in callbacks is counter-intuitive, yes. But, only if you're attempting to shoe-horn callbacks into a procedural or OO model. If you're not, then I would say that callbacks are extremely intuitive. I certainly have found them so and I have only encountered callbacks and lambda-functions in Perl (never having programmed in LISP, Haskell, Scheme, or the like).

    So I think the key first step is to make XML stream processing idiomatic in as many programming languages as possible.

    What would Mr. Bray consider stream processing to be, other than callbacks? Another response to this node says that while (<STDIN>) { ... } is interchangeable with callbacks, and I agree completely.

    Are the callback interfaces out there complete? Maybe not. But, equating that with the statement that callback interfaces are "non-idiomatic and awkward" is a logical fallacy.

    And, while I'm up on this soapbox, I would say that XML is not a panacea. It does lend itself to marking up data. Heck, that's all a "markup language" is meant to do, text being data. But, it most definitely is not a replacement for a RDBMS, and I find it used as such way too often. Personally, I find it wonderful as a layout specifier in PDF::Template. It's also a really neat data transport mechanism, having the qualities that it's flat text and in an agreed-upon format with parsers and writers readily available.

    (* - Yes, I understand that speaking ill of a "Great One" is poor form, but I find little benefit in assuming that someone has more skill than they may have.)

    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Try to do a merge sort on two streams that give you records via call backs. It is just plain impossible. Callbacks are superficially/psychologically different than iterators, true. But they are also fundamentally less flexible. So your comparison only applies in the simple cases. In the complex cases, you find your hands tied and your work much more difficult.

      But, there is no reason why XML modules can't be cast as iterators instead of forcing you to use callbacks. That is just too-lazy/not-smart-enough module design (it just takes a bit more work and knowing enough to realize that it is possible and can be important). With an interator-based XML module, things will be better.

      But that would still force a linear approach which isn't as flexible as the (often inefficient) data structure approach. Providing ways to 'seek' your XML iterators can help with some cases but not others.

      You can also come from the other end of the spectrum and make the data structure versions more efficient by having them compute things only as they are needed which is also one step on the road to being able to not keep everthing in RAM at once. On-demand data structures for XML is probably about the closest you can come to "the best of both worlds". Of course, the complexity involved means that it will never be as efficient/convenient when a much simpler approach fits what needs to be done. But that difference in efficiency is not something that I find worth worrying about (though I do prefer having choices for the sake of convenience).

      It appears that XML::Twig tries to be several of these points on the spectrum. I parse XML with regular expressions1 so I've never used it, but I hear good things about it. (:

                      - tye

      1And I suspect many that say you shouldn't parse XML with a regex don't know how to do it right. For example, "ways to rome" just does it wrong. Of course you shouldn't do it that way!

        For example, "ways to rome" just does it wrong. Of course you shouldn't do it that way!

        By any means, send an alternate way. No kidding. If the article can show a safer way to do it with regexps, then it should.

        As a matter of fact I use regexps to process XML in very specific cases. First, when I am processing "not-quite-XML", which would make a parser choke, I use regexps to turn it into real XML. Then when I need to do things that XML modules (even XML::Twig!) can't do AND when I have created the XML myself. That is, it uses no entities/comment/weird stuff at all, it has no recursive tags and I know the order of attributes in tags. Then I use regexps (or I upgrade XML::Twig, see the new wrap_children method ;--)

        The problem is people who don't know XML besides "it's HTML with my own tags" (and very few people actually know HTML that well), who use regexps for recuring processes, where the likelyhood of the XML changing in the future in ways that will break their code is quite high (<peet_peeve>mind you, the DOM has very similar issues</peet_peeve>). It is not a problem of regexps=bad, it's just a problem of knowing your tools, knowing the problem space (and there are indications that Tim Bray knows the problem space ;--) and knowing the limitations of the tools in the context in which you are using them. When you don't quite know the environment, better to play it safe and to use a parser than regexp.


        From a database perspective, XML is a species of hierarchical database. If you need to look at it from any angle other than the tree structure expressed in the nesting of the tags, you have a tedious problem. I'm not making a value judgement, per se, but you do have to keep this in mind. If you are trying to force a relational model into an XML format, you should expect "interesting" times.

        I'm curious how one does a merge sort on two XML streams. If the structure can be parsed as a linear series of records, you have something to merge. It all depends on the trees. If you have a bunch of sticks without branches, you may have something. If you have a bunch of heavily branched shrubbery, you have a hard problem, unless you are taking paths from the root to each leaf as a record.

        The "looks like an iterator" approach sounds interesting.

        Now, I'm speaking in the general case. If the particular XML you work with is more tightly constrained, you can take advantage of that to better structure your code.

        I've used XML::Twig for some parsing I do. I only need a subset of the data; it gives it to me without too much hassle. On the other hand, it did force me to syntax my invert a bit. :)


        Can you give me an example of two streams that would have a merge sort needed that would be impossible? (I'm still relatively new to XML processing, but it would seem that merge-sorting would be very simple with N streams using callbacks, at least theoretically...)

        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: is XML too hard?
by Abigail-II (Bishop) on Mar 17, 2003 at 23:53 UTC
    XML too hard? It's just LISP, isn't it? Except with more typing.

    Programmers grokked LISP 50 years ago. If XML is too hard, it says more about the quality of the programmers than of XML.


      In that respect, XML documents should/need to be converted to Lisp programs. Then all one would need is Perl/Lisp interface, where Lisp would just give you what you ask Perl for.
      I don't know lisp but this analogy seems to me not quite correct. For instance how about this XML:
      <REALLYLONGNAME with_a_property='aaa'> some free form text <ANOTHERLON +GNAME>foo</ANOTHERLONGNAME> text text</REALLYLONG +NAME>
      How would you translate this to lisp?
        (REALLYLONGNAME ((with_a_property "aaa")) ("some free form text" (ANOTHERLONGNAME () ("foo")) "t +ext text"))


Re: is XML too hard?
by Solo (Deacon) on Mar 17, 2003 at 21:02 UTC
    Is this a wishlist question? Whoo-hoo!

    I wish...

    1. to never think about encoding again
    2. default namespaces would Do What I Mean
    3. for a DBD::XML::File, or somesuch, that works a little like DBD::CSV. Specify a directory for the database, a file (or files) for the table and treat XQL and XPath like SQL statements.
    4. for a DBD::XML::URI, or somesuch, that works a little like DBD::XML::File but has a nifty URI mechanism.


      Who's scruffy lookin'?
Re: is XML too hard?
by zby (Vicar) on Mar 17, 2003 at 20:14 UTC
    The problem with XML is that people use it as a datastructure language, while it is just a markup language. The use of XML as datastructure language is a bit artificial - you can structure things by markup but it's not the most efficient way (and you are confined to trees). Why not choose a subset of a programming language - they were designed to be efficient in describing datastructures in a human readible format.

      And of course, "datastructure" is data plus context. That's exactly what XML provides. It is a "datastructure" format.

        It is. I don't say that it is not. I just say it is not a good one. By good one I define one that would be efficient in describing the datastructures we encounter most frequently in our programming practice - a general one. XML is well suited for describing structure in text document's not general programming datastructures.

        The issue is quite subtle - but look at XML::Simple. It tries to make a tree datastructure from XML file, but to do it it needs to make so many gueses that for quite uncomplicated structures you get anomalies - for instance when you save the structure build from XML and get some entirely different XML.

Re: is XML too hard?
by data64 (Chaplain) on Mar 17, 2003 at 22:36 UTC

    If you are looking for someway of querying the XML document and getting results in a procedural fashion rather than using callback, then XPath is the way to go. XML::XPath::Simple is one of the few modules that supports this approach.

    Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd

Re: is XML too hard?
by miktro (Initiate) on Mar 18, 2003 at 15:49 UTC

    I have been using XML for about 3 years now.

    What I have found important to remember is that XML is a way of tagging the INFORMATION content of data / text.
    It is not primarily a DATA storage / representation format.

    However efficient / poor the XML handling software may be - if the data isn't organised for what you need to do the program will be slow, large or convoluted.

    I quite often find that I need to restructure the XML into an organisation that is relevant and suitable for the processing I need to do.
    I generally use XSLT for this - which I find robust and clear for this step - and nearly always do some XSLT / SAX pre-processing.

    When the data matches the application then using XML + Xpath or SAX is usually a very concise way of achieving the desired result.

    There are also cases where I convert the XML file into another format entirely and use a non-XML based approach.

    I am probably fortunate as most of the times I need to use XML it is for data which is deeply heriarchical, where context matters and XPath is a very natural way of describing data clusters and relationships.

Re: is XML too hard?
by Aristotle (Chancellor) on Mar 17, 2003 at 22:49 UTC
    Time for someone to write XML::TokeParser I guess.. as if there weren't enough XML modules and parser styles/models already. :)

    Makeshifts last the longest.

      I suspect you already know this, but XML::TokeParser does exists, at least in the backpan. I've contacted the author to inquire as to why he took it off of CPAN, but I haven't heard back from him. I wanna take over maintenance of it (I will just do it if I don't hear from the author any time soon).

      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
      ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: is XML too hard?
by gmpassos (Priest) on Mar 18, 2003 at 07:20 UTC
    XML is just a way to organize data. And they sell (IBM) to us like a universal format, that will work for everything.

    Well, I respec XML, but I don't think that is useful, since they don't keep it simple (KISS). Yes, to be universal, need to be simple. This is why we talk in english in other countrys, not latin or esperanto (yes the synthetic and dead language). In other words, XML is like esperanto. Is soo like it, just think, speranto was created based in something that was already there and famous, latin, XML was created based in HTML. But speranto need to be simple, since every one can be able to use it easy, but wasn't and XML... The problem of XML is that they started from the wrong point.

    But why XML is hard? Well, XML is very easy to declare, to make the document, like HTML is. But is hard to read, to get the data that are inside. Just imagine that you need to make a parser for HTML for a viewer. HTML is not a good format to catch data.

    I think that some format simple to catch data and that enable a tree structure (when is needed) will be better. Is just a case to do it, and is something needed.

    I remmember that I saw, in the last year (XML was there a long time), in an article "Finally I found an use for XML". In other words, in the way that XML was it's not very useful, we still prefer to use our format or make one.

    Graciliano M. P.
    "The creativity is the expression of the liberty".

      Actually, natural languages are governed by two principles: economy and expressivity. If we anthropomorphize a bit (which is fairly safe for human products like language), the point is not necessarily to keep things simple, but to provide a means of expressing what you want to communicate while not having things so complex that too much processing power is lost while trying to decipher what was said.

      All languages arrive at a relative balance of expressivity and economy, but their systems are by no means stable. New ideas come up and people need to express them; physical conditions change e.g. parts of words not being pronounced, producing ambiguity) and people have to readapt so that things are clear again. A lot of the processing involved in doing this is done in the background by a system resulting from a combination of innate ability and repetitive conditioning. It's not just linguistic experience that counts---our minds also have to make sense out of things said according to the context they are said in. Think about a phrase like "Would you like to come up for a cup of coffee?" in the context of a date.

      Put into a nutshell, artificial languages are governed by the same principles, but they are usually an attempt to get the 'best' of both. The main thing is that the system should be easy to describe so that it can be learned quickly. Programming languages are pragmatic. They are all about getting things done. Markup languages are designed to add value to previously existing information (e.g. clarity/removing ambiguity).

      XML markup is just a way of adding meaning to text. It definitely fills the criterion of being easy to describe. All variation in the system of describing content is regular---it has to be or else XML would not work. But this ease of description comes with the usual price. Since XML does not have the same contextual and culture cues that meaning in human languages has, it is forced to be very explicit. That's what causes all the headaches, but it is at the same time the genius of the system. Processing it must be exhaustive, but you only have to process it in one way. Imagine if you had to include contextual and cultural cues to your markup.

      But XML is also going in other directions that resemble human semantic processing a lot more. Topic maps, for example, can provide and maintain contextual and metalinguistic information (and a whole lot of other stuff) In order to do this, however, the constructs we use must become more complex. XML is actually something simple which provides a framework for doing more complex things. In any case, pretty much everyone who works with XML is only scratching the surface of what can be done with it. We're dealing with a subset.

      So what can we do to make working with XML easier? XML allows us to do whole bunch of things, but they'll turn out to be use-impaired if we don't plan things correctly. So let's get our ducks in a row before we start adding "value". Many of the applications I have seen for XML were clearly inspired by the desire to use new technology without really considering its potential benefits.


      In fact esperanto is claimed to be really simple, much simpler than any natural language. I believe it was based on French rather then latin.
        And German, English, several easter European languages and a bad idea, IHMO.

        It's like saying: Hey lets make a language out of Ruby, Python, Awk, Perl3, Perl4 and a little bit Perl5, throw in a good measure of BASIC, a bit Lisp, stir and puke. Oh you don't like it? Well, it's not going to change, you know. That's that. And good riddance.


      The problem to create an artificial language based in other is to choce something that is not very good for the objective, only that.

      About speranto be simple, well, english is more! My main language is portuguese, my first language was french, the 3rd english, and I'm learning spanish. I'm just saying what I saw.

      Graciliano M. P.
      "The creativity is the expression of the liberty".

Re: is XML too hard?
by crenz (Priest) on Mar 18, 2003 at 12:51 UTC

    That's why I prefer YAML for data storage. No need for all the clutter XML comes with...