comment on

The easiest way is to just bring everything into memory and deal with it. In CB, you said that you don't have a TB of RAM, so I'm assuming these files are GB+ in size. At which point, I'm wondering WTF they're doing in XML :-)

I also don't quite follow how you want to do the comparison. Is it just the text of certain nodes? The text of all nodes? XML::Twig allows you to flush the in-memory representation, freeing up all the memory used thus far, but whether you can do that really depends on how you're thinking of doing the comparison. With line-record-based text, it's fairly obvious. With XML, the definition of "record" is much less clear in general - only you know the specifics.

As I said in CB, I'd consider turning XML::Twig on its head with Coro. It looks like you should be able to turn XML::Parser on its head, too. But, either way, you'll likely have to turn them on their heads. Warning, the following code is COMPLETELY untested. Channels may be required instead of rouse_wait'ing all the time.

sub twig_iterator
{
  my $file = shift;
  my $cb   = Coro::rouse_cb;
  my $twig = XML::Twig->new(
    twig_handlers => {
      elem => sub { $cb->(elem => @_) }
      otherelem => sub { $cb->(otherelem => @_) }
    },
  );
  my $done;

  # $cb->() rouses with no parameters.
  async { shift->parse(); $cb->() } $twig;

  sub {
    Coro::rouse_wait($cb); # will return the parameters received by $c
+b above
  }
}

my $itA = twig_iterator($fileA);
my $itB = twig_iterator($fileB);

while (1)
{
  # if array has no items, it's done parsing, otherwise:
  # [0] == elem name (hardcoded in above)
  # [1..$#array] == items passed in by XML::Twig to the callback
  my @A = $itA->();
  my @B = $itB->();

  # compare?
}
[download]

I'm not sure if this properly deals with end-of-files, but I think so. Like I said, UNTESTED. Be sure to have proper twig flushing (I think the [1] items will be the twig reference) so that you don't use all your RAM (if this isn't a problem, then don't use this at all - just suck the whole files in!).

In reply to Re: Processing Two XML Files in Parallel by Tanktalus
in thread Processing Two XML Files in Parallel by tedv

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks