Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Are you looking at XML processing the right way?

by dragonchild (Archbishop)
on Mar 17, 2003 at 21:07 UTC ( #243800=note: print w/ replies, xml ) Need Help??

in reply to is XML too hard?

Thus we'd be forced to use parser callbacks of one kind or another, which is sufficiently non-idiomatic and awkward ...

Personally, I'd like to see this comment explained more. "non-idiomatic"? I believe that tye and tilly (among many others) have made case that callbacks are not "non-idiomatic" in Perl. Perl is as functional as it is procedural (and heck of a lot more of both than it is OO). I would posit that the venerable Mr. Bray is less comfortable with functional logic than he is will XML. *

Writing in callbacks is counter-intuitive, yes. But, only if you're attempting to shoe-horn callbacks into a procedural or OO model. If you're not, then I would say that callbacks are extremely intuitive. I certainly have found them so and I have only encountered callbacks and lambda-functions in Perl (never having programmed in LISP, Haskell, Scheme, or the like).

So I think the key first step is to make XML stream processing idiomatic in as many programming languages as possible.

What would Mr. Bray consider stream processing to be, other than callbacks? Another response to this node says that while (<STDIN>) { ... } is interchangeable with callbacks, and I agree completely.

Are the callback interfaces out there complete? Maybe not. But, equating that with the statement that callback interfaces are "non-idiomatic and awkward" is a logical fallacy.

And, while I'm up on this soapbox, I would say that XML is not a panacea. It does lend itself to marking up data. Heck, that's all a "markup language" is meant to do, text being data. But, it most definitely is not a replacement for a RDBMS, and I find it used as such way too often. Personally, I find it wonderful as a layout specifier in PDF::Template. It's also a really neat data transport mechanism, having the qualities that it's flat text and in an agreed-upon format with parsers and writers readily available.

(* - Yes, I understand that speaking ill of a "Great One" is poor form, but I find little benefit in assuming that someone has more skill than they may have.)

We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Comment on Are you looking at XML processing the right way?
Download Code
Replies are listed 'Best First'.
Re: Are you looking at XML processing the right way? (merge)
by tye (Cardinal) on Mar 17, 2003 at 23:03 UTC

    Try to do a merge sort on two streams that give you records via call backs. It is just plain impossible. Callbacks are superficially/psychologically different than iterators, true. But they are also fundamentally less flexible. So your comparison only applies in the simple cases. In the complex cases, you find your hands tied and your work much more difficult.

    But, there is no reason why XML modules can't be cast as iterators instead of forcing you to use callbacks. That is just too-lazy/not-smart-enough module design (it just takes a bit more work and knowing enough to realize that it is possible and can be important). With an interator-based XML module, things will be better.

    But that would still force a linear approach which isn't as flexible as the (often inefficient) data structure approach. Providing ways to 'seek' your XML iterators can help with some cases but not others.

    You can also come from the other end of the spectrum and make the data structure versions more efficient by having them compute things only as they are needed which is also one step on the road to being able to not keep everthing in RAM at once. On-demand data structures for XML is probably about the closest you can come to "the best of both worlds". Of course, the complexity involved means that it will never be as efficient/convenient when a much simpler approach fits what needs to be done. But that difference in efficiency is not something that I find worth worrying about (though I do prefer having choices for the sake of convenience).

    It appears that XML::Twig tries to be several of these points on the spectrum. I parse XML with regular expressions1 so I've never used it, but I hear good things about it. (:

                    - tye

    1And I suspect many that say you shouldn't parse XML with a regex don't know how to do it right. For example, "ways to rome" just does it wrong. Of course you shouldn't do it that way!

      For example, "ways to rome" just does it wrong. Of course you shouldn't do it that way!

      By any means, send an alternate way. No kidding. If the article can show a safer way to do it with regexps, then it should.

      As a matter of fact I use regexps to process XML in very specific cases. First, when I am processing "not-quite-XML", which would make a parser choke, I use regexps to turn it into real XML. Then when I need to do things that XML modules (even XML::Twig!) can't do AND when I have created the XML myself. That is, it uses no entities/comment/weird stuff at all, it has no recursive tags and I know the order of attributes in tags. Then I use regexps (or I upgrade XML::Twig, see the new wrap_children method ;--)

      The problem is people who don't know XML besides "it's HTML with my own tags" (and very few people actually know HTML that well), who use regexps for recuring processes, where the likelyhood of the XML changing in the future in ways that will break their code is quite high (<peet_peeve>mind you, the DOM has very similar issues</peet_peeve>). It is not a problem of regexps=bad, it's just a problem of knowing your tools, knowing the problem space (and there are indications that Tim Bray knows the problem space ;--) and knowing the limitations of the tools in the context in which you are using them. When you don't quite know the environment, better to play it safe and to use a parser than regexp.


      From a database perspective, XML is a species of hierarchical database. If you need to look at it from any angle other than the tree structure expressed in the nesting of the tags, you have a tedious problem. I'm not making a value judgement, per se, but you do have to keep this in mind. If you are trying to force a relational model into an XML format, you should expect "interesting" times.

      I'm curious how one does a merge sort on two XML streams. If the structure can be parsed as a linear series of records, you have something to merge. It all depends on the trees. If you have a bunch of sticks without branches, you may have something. If you have a bunch of heavily branched shrubbery, you have a hard problem, unless you are taking paths from the root to each leaf as a record.

      The "looks like an iterator" approach sounds interesting.

      Now, I'm speaking in the general case. If the particular XML you work with is more tightly constrained, you can take advantage of that to better structure your code.

      I've used XML::Twig for some parsing I do. I only need a subset of the data; it gives it to me without too much hassle. On the other hand, it did force me to syntax my invert a bit. :)


      Can you give me an example of two streams that would have a merge sort needed that would be impossible? (I'm still relatively new to XML processing, but it would seem that merge-sorting would be very simple with N streams using callbacks, at least theoretically...)

      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        The point about "merge sort" is about callbacks versus iterators not about XML. Compare:

        # Code using iterator: while( <INPUT> ) { process_line( $_ ); } # Code using call-back: File::ProcessLines( \*INPUT, sub { process_line($_) } );
        and you see that the differences appear rather superficial and psychological.

        but it would seem that merge-sorting would be very simple with N streams using callbacks

        No, it is impossible. Using callbacks means that you have to completely process one stream before you get control back to process another stream. You can't process 2 streams at once with callbacks, much less N streams.

        Consider this code:

        # A merge sort: my $r1= <$i1>; my $r2= <$i2>; while( ! eof($i1) && ! eof($i2) ) { my $cmp= $r1 cmp $r2; print $cmp le 0 ? $r1 : $r2; $r1= <$i1> if $cmp le 0; $r2= <$i2> if 0 le $cmp; } # ...
        Now rewrite the above using File::ProcessLines and callbacks. You can't. It is impossible. To do it requires continuations which Perl doesn't have. Let's try:
        File::ProcessLines( $i1, sub { # sub1 my $r1= shift(@_); File::ProcessLines( $i2, sub { # sub2 my $r2= shift(@_); if( $r1 lt $r2 ) { return_from_sub1_but_not_from_sub2; # ... } ); } );
        So we can go as far as getting the first two records of each stream. But to get the second record of the first stream requires us to return from the first callback which won't happen until the entire second stream is processed.

        The point is that callbacks are not just harder to use, they are also fundamentally less flexible. They require processing be done in an extremely restricted linear order and make it very unnatural to even share state between the callbacks.

        Iterators are more flexible. Iterators that can seek are even more flexible. A random access data structure is still more flexible.

        Now, for an XML example. Assume there is some web site that discusses Perl. Assume also that this site has a chatterbox and you can get the last 10 lines of chatter from a XML ticker. You fetch chatter, wait a while and fetch it again. Now you want to combine those two. You could certainly use a merge sort for that. But that is impossible using callbacks.

        There are other ways you could merge such data. In this case, the data is only 10 lines so not being able to use merge sort isn't a huge problem. Also, the data should only overlap in one chunk, so you could even use callbacks to do this merging but it would be much more difficult than if you used a more flexible interface (and it would be impossible to make it deal well with some exceptions).

        Let's also assume that there are several people who have written their own chatter archiving systems. But each system has periods of down time for various reasons. Now you want to combine these archives to get as complete an archive as possible. They've each stored their data in different formats, of course. The obvious solution is to have each site send you their data in XML; it is nearly the canonical example for what XML is useful for. Now you have a case where a merge sort is important. But your XML parser only supports callbacks. So you are forced to convert each stream into something other than XML and then merge the new streams. What a waste.

        Note that callbacks can also be used where the first call does not process the entire stream before returning. This gives you a bit of a combination between callbacks and iterators (you 'iterate' to the next chunk which causes one or more of your callbacks to be called). So you can iterate whatever the "chunks" are but are forced to process each chunk using callbacks.

        In summary: Yes, callbacks are fundamentally one of the least flexible interfaces you can provide. They make it easy for the module writer to provide the interface and make it hard for the module user to use the interface. And it is not just a matter of "getting used to" using callbacks.

                        - tye

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://243800]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (16)
As of 2015-07-07 19:05 GMT
Find Nodes?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...

    Results (93 votes), past polls