Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Incremental parsing of multiple XML streams?

by nothingmuch (Priest)
on Jan 07, 2005 at 21:13 UTC ( #420383=perlquestion: print w/replies, xml ) Need Help??

nothingmuch has asked for the wisdom of the Perl Monks concerning the following question:


I need to get an event/callback interface to multiple XML streams that are coming in from the network at the same time.

I would like to say

and have the parser parse as much as it can from that string, even if it's not a complete document, or balanced chunk.

As more data becomes available, I would like to give it to the parsers, and get events from that.

Unfortunately all the parsers I've found encapsulate the read loop. One exception is XML::LibXML::SAX, which has a parse_chunk method. The problem is that the chunk needs to be a balanced string of XML, and since I don't know whether it's balanced before it's parsed (or partially parsed), I am stuck. I would like to parse the data, but I can't parse it till it's parsed. Ugh.

To further complicate things, it's difficult to know when the document is complete, because several documents come in from the stream one after another, and getting an event for the end of a document is the most reliable way to figure out when you're done.

zz zZ Z Z #!perl

Replies are listed 'Best First'.
Re: Incremental parsing of multiple XML streams?
by redhotpenguin (Deacon) on Jan 07, 2005 at 22:20 UTC
    I faced a problem very similar to this a few months back. I ended creating some ugly code which essentially stitched together partial snippets into well formed xml blocks, then passed those blocks to the parser.

    I did find a parse_more method in XML::Parser::ExpatNB, which looked like it might do the trick, but never got around to actually seeing if it could parse unbalanced xml snippets.


    I did some hacking on the ExpatNB solution, here's what I came up with. It would have been much better than stream mender I put together myself.

    #!/usr/bin/env/perl use strict; use warnings; use XML::Parser::Expat; use Data::Dumper qw(Dumper); my $parser = XML::Parser::ExpatNB->new(); $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); foreach my $snippet qw( < bro ke nx ml> con tent < /bro kenxm l> ) { print "Waiting for an event...\n"; $parser->parse_more($snippet); } sub sh { print "A start element: ", Dumper($_[1]), "\n"; } sub eh { print "An end element: ", Dumper($_[1]), "\n"; } sub ch { print "Some Data: ", Dumper($_[1]), "\n"; } 1;
      This is exactly what I was searching for... How it eluded me I cannot explain.

      Many thanks!

      zz zZ Z Z #!perl
•Re: Incremental parsing of multiple XML streams?
by merlyn (Sage) on Jan 07, 2005 at 22:26 UTC
    If it's basic vanilla XML without too much weird stuff, you can use HTML::Parser in "XML mode", setting up your proper callbacks as you recognize the right pieces. HTML::Parser objects take parse method calls which can take arbitrarily read chunks as they come in.

    Might not be perfect, but it just might be what you need.

    I demonstrate this technique in one of my columns.

    Oh, and apparently the XML::LibXML parser has a parse_chunk method which works similar to HTML::Parser's parse method. Try that too.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      That's a wonderful column, merlyn. IIRC it was the first one by you that I've read (possibly the first about perl I've read ever).

      Although it seems to provide me with a good solution, I think I will go with redhotpenguin's, as it will probably be more robust than HTML::Parser, and I already know for certain that it does exactly what I need.

      Update: Appearantly my memory is pulling tricks on me, that was not so long ago. Nevertheless, it and this one I remember as mind opening, in terms of using one tool to get another's job done.

      zz zZ Z Z #!perl
Re: Incremental parsing of multiple XML streams?
by paulbort (Hermit) on Jan 07, 2005 at 22:05 UTC

    One of the major selling points of XML is that it is well structured: an XML parser can determine if the data is well-formed.

    It sounds like the right way to go would be to use a buffer that captures the XML as it comes in, then when it gets the closing tag that matches the first opening tag, cut the buffer off and send it to be parsed.

    Perhaps the reason you're having a hard time is because you're fighting all of the things that have been done to make XML in Perl easy and reliable.

    Spring: Forces, Coiled Again!
      If that's the case, I can just wait for </xml>, and then parse that string up.

      But I need a SAX like interface, generating events for data, as it enters. I'm not concerned about the XML's tree at the end, because I need it's data parsed (as far as possible), as it arrives. The opening of an element concerns me before it's end is even on the data stream, let alone actually parsed.

      zz zZ Z Z #!perl
Re: Incremental parsing of XML?
by mirod (Canon) on Jan 07, 2005 at 21:34 UTC

    Can't you just pipe the data to the parser as it comes, so the parser thinks it is just a regular file? Have a process that gets the data, then pipes it to the process that parses it.

      parser a parser b stream a stream b
      I would like to give parser a data from stream a, and parser b data from stream b, simultaneously, in the same process (no threads).

      You can just use a select loop, and read nonblocking on your own, to get data from multiple streams at the same time.

      What I need to get this done is parsers that can be told 'here is more data', and have them return even if that's not a complete XML document. They parse as much as they can given the current buffer which they keep. When more data will become available, they continue parsing.

      Basically, instead of reading more from a handle, i'd like the parsers to return control to me, so I can give them more strings.

      zz zZ Z Z #!perl

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://420383]
Approved by kutsu
Front-paged by kutsu
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2019-08-21 02:46 GMT
Find Nodes?
    Voting Booth?

    No recent polls found