Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re2: Are you looking at XML processing the right way? (merge)

by dragonchild (Archbishop)
on Mar 18, 2003 at 15:28 UTC ( [id://244026]=note: print w/replies, xml ) Need Help??


in reply to Re: Are you looking at XML processing the right way? (merge)
in thread is XML too hard?

Can you give me an example of two streams that would have a merge sort needed that would be impossible? (I'm still relatively new to XML processing, but it would seem that merge-sorting would be very simple with N streams using callbacks, at least theoretically...)

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

  • Comment on Re2: Are you looking at XML processing the right way? (merge)

Replies are listed 'Best First'.
Re^3: Are you looking at XML processing the right way? (merge)
by tye (Sage) on Mar 18, 2003 at 18:02 UTC

    The point about "merge sort" is about callbacks versus iterators not about XML. Compare:

    # Code using iterator: while( <INPUT> ) { process_line( $_ ); } # Code using call-back: File::ProcessLines( \*INPUT, sub { process_line($_) } );
    and you see that the differences appear rather superficial and psychological.

    but it would seem that merge-sorting would be very simple with N streams using callbacks

    No, it is impossible. Using callbacks means that you have to completely process one stream before you get control back to process another stream. You can't process 2 streams at once with callbacks, much less N streams.

    Consider this code:

    # A merge sort: my $r1= <$i1>; my $r2= <$i2>; while( ! eof($i1) && ! eof($i2) ) { my $cmp= $r1 cmp $r2; print $cmp le 0 ? $r1 : $r2; $r1= <$i1> if $cmp le 0; $r2= <$i2> if 0 le $cmp; } # ...
    Now rewrite the above using File::ProcessLines and callbacks. You can't. It is impossible. To do it requires continuations which Perl doesn't have. Let's try:
    File::ProcessLines( $i1, sub { # sub1 my $r1= shift(@_); File::ProcessLines( $i2, sub { # sub2 my $r2= shift(@_); if( $r1 lt $r2 ) { return_from_sub1_but_not_from_sub2; # ... } ); } );
    So we can go as far as getting the first two records of each stream. But to get the second record of the first stream requires us to return from the first callback which won't happen until the entire second stream is processed.

    The point is that callbacks are not just harder to use, they are also fundamentally less flexible. They require processing be done in an extremely restricted linear order and make it very unnatural to even share state between the callbacks.

    Iterators are more flexible. Iterators that can seek are even more flexible. A random access data structure is still more flexible.

    Now, for an XML example. Assume there is some web site that discusses Perl. Assume also that this site has a chatterbox and you can get the last 10 lines of chatter from a XML ticker. You fetch chatter, wait a while and fetch it again. Now you want to combine those two. You could certainly use a merge sort for that. But that is impossible using callbacks.

    There are other ways you could merge such data. In this case, the data is only 10 lines so not being able to use merge sort isn't a huge problem. Also, the data should only overlap in one chunk, so you could even use callbacks to do this merging but it would be much more difficult than if you used a more flexible interface (and it would be impossible to make it deal well with some exceptions).

    Let's also assume that there are several people who have written their own chatter archiving systems. But each system has periods of down time for various reasons. Now you want to combine these archives to get as complete an archive as possible. They've each stored their data in different formats, of course. The obvious solution is to have each site send you their data in XML; it is nearly the canonical example for what XML is useful for. Now you have a case where a merge sort is important. But your XML parser only supports callbacks. So you are forced to convert each stream into something other than XML and then merge the new streams. What a waste.

    Note that callbacks can also be used where the first call does not process the entire stream before returning. This gives you a bit of a combination between callbacks and iterators (you 'iterate' to the next chunk which causes one or more of your callbacks to be called). So you can iterate whatever the "chunks" are but are forced to process each chunk using callbacks.

    In summary: Yes, callbacks are fundamentally one of the least flexible interfaces you can provide. They make it easy for the module writer to provide the interface and make it hard for the module user to use the interface. And it is not just a matter of "getting used to" using callbacks.

                    - tye
      Now that makes more sense. But, it seems that it's not that you're missing continuations (though those would certainly help) ... it's like you're missing a layer of control over the stream itself. My naive thought was that I could tell the stream "Don't process another chunk ... I'm still working on the one you just gave me, but I need to be able to do other stuff, too." Maybe, naive is the word for it.

      It sounds like there's the need for a event listener. Each stream would issue events and the listener would reap them, as appropriate. Every time an event from a stream is reaped, the stream is informed and it can then send up another event to be reaped.

      The callback part of this would be that you register with the listener that you're interested in XYZ event from streams A, B, and C. The listener would discard any event from the streams that is not being listened for by anybody.

      Does this already exist? Can this exist? (Well, I'm pretty sure it can, cause that's how modern OS's work ...)

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        So now you want asynchronous callbacks? Are you trying for the worst of both worlds (hard for the module writer to write and hard for the module users to use) ?

        So instead of doing a bit extra work and making iterators, you want to do extra work to make something asynchronous? And yet you want these seperate asynchrous streams to be synchronized with each other?

        I suggest you take some more time thinking about this. How does my callback return yet not say that the token has been processed? And then where/when do I get notification to the stream that I've now (asychronously) processed the token? What a nightmare!

        Or you want to replace three streams with one event stream? Which requires you to process events out of order.

        I think you still just don't get it.

                        - tye

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://244026]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-16 07:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found