large XML file in using XPATH

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: large XML file in using XPATH by ikegami (Patriarch) on Nov 05, 2009 at 16:49 UTC
You have 60,000 queries, and they even search along the descendant axis (which tends to include lots of nodes). Why don't you extract all the "w" nodes out first, then search through those. `use strict; use warnings; use XML::LibXML; my $file = $ARGV[0]; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file); my $root = $doc->documentElement(); for my $w_node ( $root->findnodes('/e/p//w') ) { my ($wn_anchor) = $w_node->findnodes('a[@name="wn"]') or next; my ($l_anchor) = $w_node->findnodes('a[@name="l"]' ) or next; ... }` [download] Your original sorted the results by id. I didn't replicate that behaviour. Your original ignored ids over 32812. I didn't replicate that behaviour. Were you expecting more than one anchor with each name? I ignore all but the first. Let me know if you want any of the above changed.	[reply] [d/l]
Re: large XML file in using XPATH by oha (Friar) on Nov 05, 2009 at 16:11 UTC
hard to reply to this question: i have no idea if the issue is the algorithm or the memory which swap. I'll go a bit OT and try to suggest you to change completely your approach: use a SAX parser and collect data while reading the XML only once. This could probably speed-up the process and reduce the mem usage and is a good hint if you are managing huge xml. HTH	[reply]
Re^2: large XML file in using XPATH by Anonymous Monk on Nov 05, 2009 at 16:18 UTC
No idea about SAX, but I know in XPATH every time the loop performed the file is read into the memory so it would take long. and also I cant get all my "wn" and "l" attributes in an array since some of my w elements dont have the attribute "wn" ....	[reply]
Re: large XML file in using XPATH by mattk1 (Acolyte) on Nov 05, 2009 at 21:02 UTC
Some days ago I parsed a 500KB file with XML::XPATH and printed the resulting datastructure with print Dumper .. to a file. I was suprised, because it was 36MB big.	[reply]
Re: large XML file in using XPATH by happy.barney (Friar) on Nov 05, 2009 at 16:29 UTC
As alternative to XML::XPath, try XML::LibXML (XML::LibXML::XPathContext). back to your problem, try this (not tested): `my $xp = XML::XPath->new(filename=>$file); my $nodelist = $xp->find ('//w/a'); my %map; while (my $node = $nodelist->shift) { my $key = join ':', $node->getParentNode->getAttribute ('id'), $node->getAttribute ('name') ; push @{ $map{$key} }, $node; } for my $n (1 .. 32812) { my @wns = @{ $map{$n . ':' . 'wn'} \|\| []); my @ls = @{ $map{$n . ':' . 'l'} \|\| []); }` [download]	[reply] [d/l]
Re^2: large XML file in using XPATH by Anonymous Monk on Nov 05, 2009 at 16:47 UTC
nope, unfortunately did not work :(	[reply]
Re^3: large XML file in using XPATH by happy.barney (Friar) on Nov 06, 2009 at 06:36 UTC
it's still slow on your XML ? Then use XML::LibXML, it is quite faster then XML::XPath.	[reply]
Re: large XML file in using XPATH by mirod (Canon) on Nov 07, 2009 at 07:34 UTC
Your usage of the 'name' attribute in XHTML-ish data is misleading, names should be unique (and the name attribute is deprecated in XHTML). If you have any control over the data using the 'class' attribute would be cleaner. Also XML::XPath is NOT a good module to use. It's slow and more importantly, it is not actively maintained. As mentioned above, XML::LibXML is a much better option, and the code will be very similar. That said, here is a solution with XML::Twig that should be easy on the RAM (it purges the in-memory structure after each 'w' element). Note that the code is untested, because you did not give us sample data. #!/usr/bin/perl use strict; use warnings; use XML::Twig; { my $file = $ARGV[0]; my $wns={}; # { <id> => [ <a text>, ... ], ... } my $ls={}; # same XML::Twig->new( twig_handlers => { q{/e/p//w[@id]/a[@name="wn"]} => su +b { add_value( @_, $wns); }, q{/e/p//w[@id]/a[@name="l"]} => su +b { add_value( @_, $ls ); }, # once you're done with a w element + you can get rid of it q{/e/p//w} => sub { $ +_->flush; }, }, ) ->parsefile( $file); for my $n (1 .. 32812) { next unless $ls->{$n} && $wns->{$n}; print "@{$ls->{$n}}#@{$wns->{$n}}\n"; } } # get the id and then add the text of a in the proper array sub add_value { my( $t, $a, $store)= @_; my $id= $a->parent->id; $store->{$id} \|\|= []; push @{$store->{$id}}, $a->text; # or xml_string if you want embed +ded tags } [download]	[reply] [d/l]
Re^2: large XML file in using XPATH by Anonymous Monk on Jun 03, 2013 at 00:37 UTC
Hello, I need a large xml file (about 2MB size) with queries for a research project. The thing is that I need real data with already existing queries that I can reference in my project. Please guide me where I can download them. Best, Ankit	[reply]


laziness, impatience, and hubris
	PerlMonks