Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Your skill will accomplish
what the force of many cannot
 
PerlMonks

large XML file in using XPATH

by Anonymous Monk
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Nov 05, 2009 at 15:31 UTC ( #805287=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a large XML file, this way the access is very very very slow. How can I speed up the process while using XMLPATH module? please not that some of my "a" elements does not have "wn" attribute, so this was the obly way I could think of ... Thansk.

use strict; use warnings; use XML::XPath; my $file = $ARGV[0]; my $xp = XML::XPath->new(filename=>$file); for my $n (1 .. 32812) { my $wnposnodeset = $xp->find('/e/p//w[@id='.$n.']/a[@name="wn"]'); my @wns; if (my @wnnodelist = $wnnodeset->get_nodelist) { @wns = map($_->string_value, @wnnodelist);} my $lnodeset = $xp->find('/e/p//w[@id='.$n.']/a[@name="l"]'); my @ls; if (my @lnodelist = $lnodeset->get_nodelist) { @ls = map($_->string_value, @lnodelist);} print "@ls#@wns\n"; }

Comment on large XML file in using XPATH
Download Code
Re: large XML file in using XPATH
by oha (Friar) on Nov 05, 2009 at 16:11 UTC
    hard to reply to this question: i have no idea if the issue is the algorithm or the memory which swap. I'll go a bit OT and try to suggest you to change completely your approach:

    use a SAX parser and collect data while reading the XML only once. This could probably speed-up the process and reduce the mem usage and is a good hint if you are managing huge xml.

    HTH

[reply]
      No idea about SAX, but I know in XPATH every time the loop performed the file is read into the memory so it would take long. and also I cant get all my "wn" and "l" attributes in an array since some of my w elements dont have the attribute "wn" ....
[reply]
      Some days ago I parsed a 500KB file with XML::XPATH and printed the resulting datastructure with print Dumper .. to a file. I was suprised, because it was 36MB big.
[reply]
Re: large XML file in using XPATH
by happy.barney (Sexton) on Nov 05, 2009 at 16:29 UTC

    As alternative to XML::XPath, try XML::LibXML (XML::LibXML::XPathContext).

    back to your problem, try this (not tested):
    my $xp = XML::XPath->new(filename=>$file); my $nodelist = $xp->find ('//w/a'); my %map; while (my $node = $nodelist->shift) { my $key = join ':', $node->getParentNode->getAttribute ('id'), $node->getAttribute ('name') ; push @{ $map{$key} }, $node; } for my $n (1 .. 32812) { my @wns = @{ $map{$n . ':' . 'wn'} || []); my @ls = @{ $map{$n . ':' . 'l'} || []); }
[reply]
[d/l]
      nope, unfortunately did not work :(
[reply]
        it's still slow on your XML ? Then use XML::LibXML, it is quite faster then XML::XPath.
[reply]
Re: large XML file in using XPATH
by ikegami (Saint) on Nov 05, 2009 at 16:49 UTC

    You have 60,000 queries, and they even search along the descendant axis (which tends to include lots of nodes). Why don't you extract all the "w" nodes out first, then search through those.

    use strict; use warnings; use XML::LibXML; my $file = $ARGV[0]; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file); my $root = $doc->documentElement(); for my $w_node ( $root->findnodes('/e/p//w') ) { my ($wn_anchor) = $w_node->findnodes('a[@name="wn"]') or next; my ($l_anchor) = $w_node->findnodes('a[@name="l"]' ) or next; ... }

    Your original sorted the results by id. I didn't replicate that behaviour.
    Your original ignored ids over 32812. I didn't replicate that behaviour.
    Were you expecting more than one anchor with each name? I ignore all but the first.
    Let me know if you want any of the above changed.

[reply]
[d/l]
Re: large XML file in using XPATH
by mirod (Canon) on Nov 07, 2009 at 07:34 UTC

    Your usage of the 'name' attribute in XHTML-ish data is misleading, names should be unique (and the name attribute is deprecated in XHTML). If you have any control over the data using the 'class' attribute would be cleaner.

    Also XML::XPath is NOT a good module to use. It's slow and more importantly, it is not actively maintained. As mentioned above, XML::LibXML is a much better option, and the code will be very similar.

    That said, here is a solution with XML::Twig that should be easy on the RAM (it purges the in-memory structure after each 'w' element). Note that the code is untested, because you did not give us sample data.

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; { my $file = $ARGV[0]; my $wns={}; # { <id> => [ <a text>, ... ], ... } my $ls={}; # same XML::Twig->new( twig_handlers => { q{/e/p//w[@id]/a[@name="wn"]} => su +b { add_value( @_, $wns); }, q{/e/p//w[@id]/a[@name="l"]} => su +b { add_value( @_, $ls ); }, # once you're done with a w element + you can get rid of it q{/e/p//w} => sub { $ +_->flush; }, }, ) ->parsefile( $file); for my $n (1 .. 32812) { next unless $ls->{$n} && $wns->{$n}; print "@{$ls->{$n}}#@{$wns->{$n}}\n"; } } # get the id and then add the text of a in the proper array sub add_value { my( $t, $a, $store)= @_; my $id= $a->parent->id; $store->{$id} ||= []; push @{$store->{$id}}, $a->text; # or xml_string if you want embed +ded tags }
[reply]
[d/l]

Back to Seekers of Perl Wisdom


Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: perlquestion [id://805287]
Approved by redgreen
Front-paged by redgreen
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others surveying the Monastery: (10)
GrandFather
wfsp
atcroft
herveus
Eyck
clinton
djp
vishi83
gnosti
darkman0101
As of 2009-11-21 09:48 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (729 votes), past polls