Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Processing Two XML Files in Parallel

by tedv (Pilgrim)
on Jul 21, 2011 at 21:10 UTC ( #915997=perlquestion: print w/ replies, xml ) Need Help??
tedv has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script that needs to process two XML files in parallel. It needs to take element #1 from file A and element #1 from file B, and output a new element into file C. Then it performs the same operation on element #2, and so on. As a very simple example:


Input file A:
<doc> <elem>A</elem> <elem>B</elem> <elem>C</elem> </doc>
Input file B:
<doc> <elem>1</elem> <elem>5</elem> <elem>10</elem> </doc>
Output file C:
<doc> <elem>A</elem> <elem>BBBBB</elem> <elem>CCCCCCCCCC</elem> </doc>

The catch is that the files are very large, so you cannot parse them all into memory at once. And sadly the XML::Parser interface seems to require parsing the entire first file, handling all callbacks, before you can invoke a call to the parsing of the second file.

Now if this was just a simple text file, the code would be pretty simple. It looks something like this:

# Open both input files open A, "<$file_a" or die "Unable to open $file_a: $!\n"; open B, "<$file_b" or die "Unable to open $file_b: $!\n"; # Process the files parallel while (1) { # Read the lines my $a = <A>; my $b = <B>; # Good coders would check and warn if one entry was defined and the +other # was not, but this is just an example, so you should be happy you e +ven # get comments. last if !defined $a || !defined $b; # Process the output print data_transform($a, $b); } close A; close B;

But because it's XML, everything is more painful. Does anyone know of what might work? Someone suggested XML::Twig, but I'm still reading the documentation to make sure the internal implementation doesn't prohibit this from working.


-Ted

Comment on Processing Two XML Files in Parallel
Select or Download Code
Re: Processing Two XML Files in Parallel
by Logicus on Jul 21, 2011 at 21:29 UTC

    Does each element have a line to itself or is the data multiline? As in, if we read line say 123 from file A, will line 123 in file B be the correct line to do the processing with?

    If that is the case, then you could just read both files a line at a time, and use a simple regex to get the value out of the <elem> wrapper;

    my ($a,$b,$value_a,$value_b); while (1) { $a = <A>; $b = <B>; if ($a =~ m/<elem>(.*?)</elem>/) { $value_a = $1; } if ($b =~ m/<elem>(.*?)</elem>/) { $value_b = $1; } last if !defined $value_a || !defined $value_b; print data_transform($value_a, $value_b); }

    I'm sure better perl adepts than me could write it better/faster, but I think that would work if the files have a line for line concurrency.

      So you like catch phrases, uh?
      Let me tell you something:
      In about 97% of the time, parsing XML with regexes is the root of all evil. The remaining 3% are left for one-time, quick & dirty scripts and maybe some special cases (where you can assure the XML will stay exactly like that).
      Let me tell you why:
      The creator of the XML to parse might change it. All elements might be on one line. Maybe there will be some empty lines between the tags. Maybe the elem tags will get attributes in the future. In all cases your script will suddenly stop to work, although the actual content you want didn't change. And somebody has to fix it quickly. In the end it's more work then just doing it right from the beginning, and potentially you annoyed a customer and your boss.

      That's how experienced programmers think. Because they know that things like that happen.
      You not only posted a quick & dirty solution, you even bashed someone for posting a clean and correct solution. A quick & dirty solution is ok (although it would be nice to comment that it depends on the exact XML format), and you actually got some ++ for it, but then bashing someone elses correct solution is just infantile.

        Hey man I asked before hand if the lines had 1-to-1 concurrency, and I didn;t bash anyone ~I asked WHY. So now you told me WHY i'm happy... THNAK YOU
Re: Processing Two XML Files in Parallel
by ikegami (Pope) on Jul 21, 2011 at 22:19 UTC
    use strict; use warnings; use XML::LibXML qw( ); die "Usage" if @ARGV != 3; my $parser = XML::LibXML->new(); my @counts; { my $doc = $parser->parse_file($ARGV[1]); my $root = $doc->documentElement(); @counts = map $_->textContent, $root->findnodes('elem'); } { my $doc = $parser->parse_file($ARGV[0]); my $root = $doc->documentElement(); for my $node ($root->findnodes('elem')) { die "Not enough counts" if !@counts; $node->appendText( $node->textContent() x (shift(@counts) - 1) ); } print $doc->toFile($ARGV[2]); } die "Too many counts" if @counts;

    Or if counts of 0 are acceptable:

    my $new_text = $node->textContent() x shift(@counts); $node->removeChild($_) for $node->findnodes('text()'); $node->appendText($new_text);

    Tested.

      You see this is what I don't like about the way perl is being used these days... what a huge load of verbose and difficult to understand mass of code that is!

      Why oh why, do you need all that complexity to do a simple task? WHY???? I refuse to learn how to write code which has to be that verbose, without some sort of damn good reason!

      Perl attracted me with its elegance and simplicity, but then when I dig into it, the way it's apparently "supposed to be done" is neither elegant or simple! My aversion to code like the above, is exactly the same reason that I'm NOT a Java coder.

      What ever happened to keep it simple?

        You know, two thousand years ago there was language called 'koine'. This language was based on greek and it's purpose was the easy communication between various folks in ancient world.
        Ancient greek also were disappointed by koine, but who cares? The code above easy to at least my eyes to read.
        Why do you complain about the length of my solution when yours is twice as long? (Well, it would be if you hadn't only done half the work.)
        I refuse to learn?

        And that sums up everything we have learnt of you over the last couple of weeks.

        I have a suggestion for you. Change your career (or hobby) because as a programmer you are no good at all, and never will be. YOu do not have the aptitude for it.

        That doesn't mean you aren't intelligent, just that you do not have the right mindset for programming. Or all of them.

        But give up programming because you will never be even vaguely useful at it, never mind good.

        Try writing or poetry or music or chess or carpentry.

Re: Processing Two XML Files in Parallel
by Tanktalus (Canon) on Jul 21, 2011 at 23:42 UTC

    The easiest way is to just bring everything into memory and deal with it. In CB, you said that you don't have a TB of RAM, so I'm assuming these files are GB+ in size. At which point, I'm wondering WTF they're doing in XML :-)

    I also don't quite follow how you want to do the comparison. Is it just the text of certain nodes? The text of all nodes? XML::Twig allows you to flush the in-memory representation, freeing up all the memory used thus far, but whether you can do that really depends on how you're thinking of doing the comparison. With line-record-based text, it's fairly obvious. With XML, the definition of "record" is much less clear in general - only you know the specifics.

    As I said in CB, I'd consider turning XML::Twig on its head with Coro. It looks like you should be able to turn XML::Parser on its head, too. But, either way, you'll likely have to turn them on their heads. Warning, the following code is COMPLETELY untested. Channels may be required instead of rouse_wait'ing all the time.

    sub twig_iterator { my $file = shift; my $cb = Coro::rouse_cb; my $twig = XML::Twig->new( twig_handlers => { elem => sub { $cb->(elem => @_) } otherelem => sub { $cb->(otherelem => @_) } }, ); my $done; # $cb->() rouses with no parameters. async { shift->parse(); $cb->() } $twig; sub { Coro::rouse_wait($cb); # will return the parameters received by $c +b above } } my $itA = twig_iterator($fileA); my $itB = twig_iterator($fileB); while (1) { # if array has no items, it's done parsing, otherwise: # [0] == elem name (hardcoded in above) # [1..$#array] == items passed in by XML::Twig to the callback my @A = $itA->(); my @B = $itB->(); # compare? }
    I'm not sure if this properly deals with end-of-files, but I think so. Like I said, UNTESTED. Be sure to have proper twig flushing (I think the [1] items will be the twig reference) so that you don't use all your RAM (if this isn't a problem, then don't use this at all - just suck the whole files in!).

        use strict; use warnings; use XML::LibXML::Reader qw( :types ); sub new { my $class = shift; return bless({ reader => XML::LibXML::Reader->new(@_), elem_depth => 0, buf => '', }, $class); } sub get_next { my ($self) = @_; my $reader = $self->{reader}; for (;;) { return () if $reader->read() != 1; if ($reader->nodeType() == XML_READER_TYPE_TEXT) { if ($self->{elem_depth} && $reader->depth() == $self->{elem_d +epth} + 1) { $self->{buf} .= $reader->value(); } } elsif ($reader->nodeType() == XML_READER_TYPE_ELEMENT) { if ($reader->name() eq 'elem') { $self->{elem_depth} = $reader->depth(); } } elsif ($reader->nodeType() == XML_READER_TYPE_END_ELEMENT) { if ($reader->name() eq 'elem') { return substr($self->{buf}, 0, length($self->{buf}), ''); } } } } { my $reader1 = __PACKAGE__->new(location => "file1.xml"); my $reader2 = __PACKAGE__->new(location => "file2.xml"); for (;;) { my $text1 = $reader1->get_next(); my $text2 = $reader2->get_next(); last if !defined($text1) && !defined($text2); die if !defined($text1); die if !defined($text2); process_data($text1, $text2); } }

        Assumes all elem elements are "interesting" ones, not just the ones found under the root. Easy to change, though.

        Output left to the user. May I suggest XML::Writer since it keeps next to nothing in memory.

      I concede that the difficulty of processing these two files suggests to me that something has gone very wrong with the input specifications. And since XML is part of that, I naturally assume it's the fault of XML. I might try to get them to change the specification such that each line is generally well formed XML, but cannot contain any new lines. Then just do a standard double line reader.


      -Ted
Re: Processing Two XML Files in Parallel
by Jenda (Abbot) on Jul 21, 2011 at 23:55 UTC

    Depends on the exact format of your XML files, but maybe XML::Records could help. It allows you to ask for the ext "record" from the XML so you can ask for the first from one file, the first from the second, then again from the first and again from the second and so forth.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Processing Two XML Files in Parallel
by mirod (Canon) on Jul 22, 2011 at 07:08 UTC

    One way to do this is to use XML::Twig and Coro: have one thread parse the first input file and an other one parse the other one. Pass control between the 2 threads, after each elem has been parsed:

    #!/usr/bin/perl use strict; use warnings; use Coro; use XML::Twig; use Test::More; use Perl6::Slurp; use autodie qw(open); my $INPUT_A = "input_A.xml"; # input file A my $INPUT_B = "input_B.xml"; # input file B my $OUTPUT = "output.xml"; my $EXPECTED = "expected.xml"; # output file C open( my $out, '>', $OUTPUT); my $times; # global, maybe Coro has a better way to pass it around but + I don't know it my $t1= XML::Twig->new( twig_handlers => { elem => \&main_elem }, keep +_spaces => 1); my $t2= XML::Twig->new( twig_handlers => { elem => \&get_times }); # to get the numbers first, before the letters, t2 will be parsed in t +he main loop async { $t1->parsefile( $INPUT_A); }; $t2->parsefile( $INPUT_B); print {$out} "\n"; # missing \n for some reason $t1->flush( $out); print {$out} "\n"; # missing \n for some reason close $out; is( slurp( $OUTPUT), slurp( $EXPECTED), 'the one test'); done_testing(); sub main_elem { my( $t, $elem)= @_; $elem->set_text( $elem->text x $times); $t->flush( $out); cede; } sub get_times { my( $t, $elem)= @_; $times= $elem->text; $t->purge; cede; }

    You will need to check that memory is indeed freed after each record. It should be OK, but I don't know exactly how Coro deals with memory, I had never used it before today.

    Thank you for asking this and making me look into the problem. And to whoever mentioned Coro yesterday in the CB. This is something I had wanted to do for a long time, but I had always deferred it since I did not really need it for work. Overall it was pretty painless though, the Coro intro is quite well written.

    update: also, I should have read Tanktalus answer, above, since he obviously knows Coro a lot better than I do. I am still happy I answered though, at least I learned something.

Re: Processing Two XML Files in Parallel
by Anonymous Monk on Jul 22, 2011 at 11:33 UTC
    I keep wondering if you could use an SQLite database (file...) here. Kind of like a tied-hash only better. I do not know how many elements in these massive files actually change from one run to the next; nor how many are in common. But maybe you could capture data from first one then the other into an SQLite table, which, since it is just a file, requires no server setup. Determine what differences actually exist, then use these to update or to rebuild file C. The overall strategy of using two massive XML files needs to be reviewed carefully, either by you or by your managers or both.
Re: Processing Two XML Files in Parallel
by ambrus (Abbot) on Jul 24, 2011 at 20:39 UTC

    I agree with the previous replies in that running two XML parsers each in its own Coros seems to be a good way to do this. However, I'd like to show a solution not using Coro, just for the challenge of it.

    This solution uses the stream parsing capability of XML::Parse. The documentation of XML::Twig states that you probably should not use with XML::Twig and is untested.

    We read the input XML files in small chunks (20 bytes here for demonstration, but should be much more than that in the real application). In each loop iteration, we read from the file that's behind the other, that is, the one from which we have read less items so far. This way, the files remain in sync even if the length of the items differ. Once the xml parser has found an item from both files, we pair these and print an item with the two texts concatenated.

    The warnings I have commented out show that the files are indeed read in parallel. I also hope that chunks of the file we have processed don't remain in memory, and there are no other bugs, but then you should of course verify this if you want to use this code in production.

    use warnings; use strict; use Encode; use XML::Twig; binmode STDERR, ":encoding(iso-8859-2)"; our(@XMLH, @xmln, @tw, @pa, @eof, @it, $two, $roo); for my $n (0 .. 1) { $xmln[$n] = shift || ("a1.xml", "a2.xml")[$n]; open $XMLH[$n], "<", $xmln[$n] or die "error open xml${n}: $!"; $tw[$n] = XML::Twig->new; $tw[$n]->setTwigHandler("item", sub { my($twt, $e) = @_; my $t = $e->text; #warn " "x(24+8*$n), "${n}g|$t|\n"; push @{$it[$n]}, $t; $twt->purge; }); $pa[$n] = $tw[$n]->parse_start; $it[$n] = []; } $two = XML::Twig->new(output_filter => "safe", pretty_print => "nice") +; $roo = XML::Twig::Elt->new("doc"); $two->set_root($roo); while (1) { my $n = undef; my $itq = 1e9999; for my $j (0 .. 1) { if (!$eof[$j] && @{$it[$j]} <= $itq) { $n = $j; $itq = @{$it[$j]}; } } if (!defined($n)) { last; } if (read $XMLH[$n], my $b, 20) { #my $bp = decode("iso-8859-2", $b); $bp =~ y/\r\n/./; #warn " "x(8+8*$n), "${n}r|$bp|\n"; $pa[$n]->parse_more($b); } else { eof($XMLH[$n]) or die "error reading xml${n}"; $pa[$n]->parse_done; $eof[$n]++; } my $eo; while (@{$it[0]} && @{$it[1]}) { my $i0 = shift @{$it[0]}; my $i1 = shift @{$it[1]}; $eo = XML::Twig::Elt->new("item", "$i0 $i1"); $eo->paste_last_child($roo); #warn "p|$i0 $i1|\n"; } if (defined($eo)) { $two->flush_up_to($eo); } } for my $n (0 .. 1) { if (my $c = @{$it[$n]}) { warn "warning: xml${n} has $c additional items"; } } $two->flush; #warn "all done"; __END__

    Update 2013-04-23: RFC: Simulating Ruby's "yield" and "blocks" in Perl may be related.

      Fiddling with Coro or reading in blocks when all you need is a pull style parser seems a bit silly. Even though it is a nice exercise.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Processing Two XML Files in Parallel
by GrandFather (Cardinal) on Jul 28, 2011 at 08:42 UTC

    You might like to try XML::TreePuller. If your sample is a fair representation of your problem then something like the following ought turn the trick for you:

    use strict; use warnings; use XML::TreePuller; use XML::Writer; my $fileA = <<XML; <doc> <elem>A</elem> <elem>B</elem> <elem>C</elem> </doc> XML my $fileB = <<XML; <doc> <elem>1</elem> <elem>5</elem> <elem>10</elem> </doc> XML # Open both input files open my $inA, "<", \$fileA; open my $inB, "<", \$fileB; my $readerA = XML::TreePuller->new (IO => $inA); my $readerB = XML::TreePuller->new (IO => $inB); my $writer = XML::Writer->new (DATA_MODE => 1); # Process the files parallel $readerA->iterate_at('/doc/elem' => 'short'); $readerB->iterate_at('/doc/elem' => 'short'); $writer->startTag ('doc'); while ((my $elmtA = $readerA->next ()) && (my $elmtB = $readerB->next +())) { my $nameA = $elmtA->name (); my $nameB = $elmtB->name (); next if $nameA ne 'elem'; die "Element mismatch: $nameA ne $nameB\n" if $nameA ne $nameB; $writer->dataElement ($nameA, $elmtA->text () x $elmtB->text ()); } $writer->endTag(); close $inA; close $inB;

    Prints:

    <doc> <elem>A</elem> <elem>BBBBB</elem> <elem>CCCCCCCCCC</elem> </doc>
    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://915997]
Approved by ww
Front-paged by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2014-08-02 10:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (56 votes), past polls