Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

large XML file in using XPATH

by Anonymous Monk
on Nov 05, 2009 at 15:31 UTC ( #805287=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a large XML file, this way the access is very very very slow. How can I speed up the process while using XMLPATH module? please not that some of my "a" elements does not have "wn" attribute, so this was the obly way I could think of ... Thansk.
use strict; use warnings; use XML::XPath; my $file = $ARGV[0]; my $xp = XML::XPath->new(filename=>$file); for my $n (1 .. 32812) { my $wnposnodeset = $xp->find('/e/p//w[@id='.$n.']/a[@name="wn"]'); my @wns; if (my @wnnodelist = $wnnodeset->get_nodelist) { @wns = map($_->string_value, @wnnodelist);} my $lnodeset = $xp->find('/e/p//w[@id='.$n.']/a[@name="l"]'); my @ls; if (my @lnodelist = $lnodeset->get_nodelist) { @ls = map($_->string_value, @lnodelist);} print "@ls#@wns\n"; }

Replies are listed 'Best First'.
Re: large XML file in using XPATH
by ikegami (Patriarch) on Nov 05, 2009 at 16:49 UTC

    You have 60,000 queries, and they even search along the descendant axis (which tends to include lots of nodes). Why don't you extract all the "w" nodes out first, then search through those.

    use strict; use warnings; use XML::LibXML; my $file = $ARGV[0]; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file); my $root = $doc->documentElement(); for my $w_node ( $root->findnodes('/e/p//w') ) { my ($wn_anchor) = $w_node->findnodes('a[@name="wn"]') or next; my ($l_anchor) = $w_node->findnodes('a[@name="l"]' ) or next; ... }

    Your original sorted the results by id. I didn't replicate that behaviour.
    Your original ignored ids over 32812. I didn't replicate that behaviour.
    Were you expecting more than one anchor with each name? I ignore all but the first.
    Let me know if you want any of the above changed.

Re: large XML file in using XPATH
by oha (Friar) on Nov 05, 2009 at 16:11 UTC
    hard to reply to this question: i have no idea if the issue is the algorithm or the memory which swap. I'll go a bit OT and try to suggest you to change completely your approach:

    use a SAX parser and collect data while reading the XML only once. This could probably speed-up the process and reduce the mem usage and is a good hint if you are managing huge xml.

    HTH

      No idea about SAX, but I know in XPATH every time the loop performed the file is read into the memory so it would take long. and also I cant get all my "wn" and "l" attributes in an array since some of my w elements dont have the attribute "wn" ....
      Some days ago I parsed a 500KB file with XML::XPATH and printed the resulting datastructure with print Dumper .. to a file. I was suprised, because it was 36MB big.
Re: large XML file in using XPATH
by happy.barney (Friar) on Nov 05, 2009 at 16:29 UTC

    As alternative to XML::XPath, try XML::LibXML (XML::LibXML::XPathContext).

    back to your problem, try this (not tested):
    my $xp = XML::XPath->new(filename=>$file); my $nodelist = $xp->find ('//w/a'); my %map; while (my $node = $nodelist->shift) { my $key = join ':', $node->getParentNode->getAttribute ('id'), $node->getAttribute ('name') ; push @{ $map{$key} }, $node; } for my $n (1 .. 32812) { my @wns = @{ $map{$n . ':' . 'wn'} || []); my @ls = @{ $map{$n . ':' . 'l'} || []); }
      nope, unfortunately did not work :(
        it's still slow on your XML ? Then use XML::LibXML, it is quite faster then XML::XPath.
Re: large XML file in using XPATH
by mirod (Canon) on Nov 07, 2009 at 07:34 UTC

    Your usage of the 'name' attribute in XHTML-ish data is misleading, names should be unique (and the name attribute is deprecated in XHTML). If you have any control over the data using the 'class' attribute would be cleaner.

    Also XML::XPath is NOT a good module to use. It's slow and more importantly, it is not actively maintained. As mentioned above, XML::LibXML is a much better option, and the code will be very similar.

    That said, here is a solution with XML::Twig that should be easy on the RAM (it purges the in-memory structure after each 'w' element). Note that the code is untested, because you did not give us sample data.

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; { my $file = $ARGV[0]; my $wns={}; # { <id> => [ <a text>, ... ], ... } my $ls={}; # same XML::Twig->new( twig_handlers => { q{/e/p//w[@id]/a[@name="wn"]} => su +b { add_value( @_, $wns); }, q{/e/p//w[@id]/a[@name="l"]} => su +b { add_value( @_, $ls ); }, # once you're done with a w element + you can get rid of it q{/e/p//w} => sub { $ +_->flush; }, }, ) ->parsefile( $file); for my $n (1 .. 32812) { next unless $ls->{$n} && $wns->{$n}; print "@{$ls->{$n}}#@{$wns->{$n}}\n"; } } # get the id and then add the text of a in the proper array sub add_value { my( $t, $a, $store)= @_; my $id= $a->parent->id; $store->{$id} ||= []; push @{$store->{$id}}, $a->text; # or xml_string if you want embed +ded tags }
      Hello, I need a large xml file (about 2MB size) with queries for a research project. The thing is that I need real data with already existing queries that I can reference in my project. Please guide me where I can download them. Best, Ankit

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://805287]
Approved by redgreen
Front-paged by redgreen
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2022-05-23 20:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (82 votes). Check out past polls.

    Notices?