Your usage of the 'name' attribute in XHTML-ish data is misleading, names should be unique (and the name attribute is deprecated in XHTML). If you have any control over the data using the 'class' attribute would be cleaner.
Also XML::XPath is NOT a good module to use. It's slow and more importantly, it is not actively maintained. As mentioned above, XML::LibXML is a much better option, and the code will be very similar.
That said, here is a solution with XML::Twig that should be easy on the RAM (it purges the in-memory structure after each 'w' element). Note that the code is untested, because you did not give us sample data.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
{
my $file = $ARGV[0];
my $wns={}; # { <id> => [ <a text>, ... ], ... }
my $ls={}; # same
XML::Twig->new( twig_handlers => { q{/e/p//w[@id]/a[@name="wn"]} => su
+b { add_value( @_, $wns); },
q{/e/p//w[@id]/a[@name="l"]} => su
+b { add_value( @_, $ls ); },
# once you're done with a w element
+ you can get rid of it
q{/e/p//w} => sub { $
+_->flush; },
},
)
->parsefile( $file);
for my $n (1 .. 32812) {
next unless $ls->{$n} && $wns->{$n};
print "@{$ls->{$n}}#@{$wns->{$n}}\n";
}
}
# get the id and then add the text of a in the proper array
sub add_value
{ my( $t, $a, $store)= @_;
my $id= $a->parent->id;
$store->{$id} ||= [];
push @{$store->{$id}}, $a->text; # or xml_string if you want embed
+ded tags
}
|