Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

XML::Twig -vs- XML::XPath

by buttroast (Scribe)
on Oct 13, 2004 at 02:10 UTC ( #398752=perlquestion: print w/ replies, xml ) Need Help??
buttroast has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a program that reads a large GEDCOM XML 6.0 file and pulls out a specific individual based on an Id parameter passed to the script.

I decided to use XML::Twig because it only stores the selected individual's node in memory, instead of storing the entire XML file in memory. Since XML::Twig does not allow all of the XPath functionality as far as w3 is concerned, I started looking into using XML::XPath instead of XML::Twig.

My question is, does XML::XPath keep the entire XML file in memory? or does it work like XML::Twig? or do I just not understand either of them correctly :-) Also, if there are any better suggestions as far as reading XML efficiently, please let me know.

My thanks ahead of time...


So, to make sure I understand this, would I be correct in saying both XML::Twig and XML::XPath each read the entire XML file into memory once, with the difference being that XML::Twig gets rid of it as soon as it gets the node(s) it wants to further process?

Thanks buttroast

Comment on XML::Twig -vs- XML::XPath
Re: XML::Twig -vs- XML::XPath
by leriksen (Curate) on Oct 13, 2004 at 03:24 UTC
    As far as your request for other (I wont presume better) ways to extract fragments of XML, I use the following approach.

    Say you have a large XML document that contains a lot of Account nodes, only some of which you require.
    I use XML::Parser to go through the XML file once first, using the Subs style. I have subroutines that catch the start and end of just the nodes I require. E.g.

    use XML::Parser; ... my $p = XML::Parser->new(Style => 'Subs',...); my %AccountLocations; my $CurrentAccount; $p->parsefile($xmlfile); # Handler for <Account> start element sub Account { my ($parser, $tag, %attributes) =@_; # store the byte-offset into the file of the required node if (SomeConditionMet(%attributes) { # remember it so we can use it for the closing tag $CurrentAccount = $attributes{Number}; $AccountLocations{$CurrentAccount}->{Start} => $p->currentbyte()}; } } # Handler for </Account> end element sub Account_ { my ($p, $tag) = @_; # XML::Parser::current_byte is the offset in the file to the start # of the node. Add the length of what the parser # recognised to get the offset to the end of the node # is this the closing tag for a required account ? if (defined $CurrentAccount) { $Accounts{$CurrentAccount}->{End} = $p->current_byte() + length($p->recognized_string()); $CurrentAccount = undef; } } sub SomeConditionMet { my (%attributes) = @_; # your code goes here ... }

    This code just stores the start and end offsets of the fragments I really need. If a fragment doesn't satisfy SomeConditionMet(), it is bypassed. Later in the code, when I need to fully parse the fragment required, I have code like this -

    ... use XML::Simple qw(:strict); use IO::File; use Carp; ... my $XMLHandle = IO::File->new($xmlfile, O_RDONLY) or ... foreach my $Account (keys %Accounts) { # get an in-memory hash representation of the fragment my $xml = LoadAccount($Account); # process the in-memory hash of this fragment # your code goes here ... } sub LoadAccount { my ($Account) = @_; my $length = $Accounts{$Account}->{End} - $Accounts{$Account}->{Start}; my $xml; # I have removed all error handling of the read # and seek to simplify this example # Jump to the required offset in the file $XMLHandle->seek($Accounts{$Account}->{Start}, SEEK_SET); $XMLHandle->read($xml, $length); eval { # parse the XML, using XML::Simple to get the resulting data # structure the way you like it $xs = XMLin($xml, ForceArray => [qw(...)], KeyAttr => []); }; if ($@) { $croak("Account $Account :: cannot parse xml $xml - $@"); } return $xs; }

    This pretty much emulates what XML::Twig does, but you have pretty fine control of what is happening. It is also pretty fast. I used this technique to process ~4000 2k XML files in < 10 seconds, on my 1.8G RH9 box with Perl 5.8.0.

    You could process the required elements in the closing tag handler of XML::Parser, if you didn't like building the index of offsets first.

    use brain;

Re: XML::Twig -vs- XML::XPath
by mirod (Canon) on Oct 13, 2004 at 15:02 UTC

    XML::XPath indeed builds the DOM for the entire document in memory.

    But if you need all of the power of XPath and the small memory footprint of XML::Twig, maybe there's hope: if XML::XPath and XML::Twig are both installed, you can use... XML::Twig::XPath, which essentially gives you XML::XPath's findnodes and findvalue in XML::Twig. It re-uses the XPath engine of XML::XPath, so you should not have any surprise with it. The Perl Review has an article about it, which should be online at some point.

Re: XML::Twig -vs- XML::XPath
by leriksen (Curate) on Oct 14, 2004 at 01:34 UTC
    ... would I be correct in saying both XML::Twig and XML::XPath each read the entire XML file into memory once

    Looking at the doc for XML::Twig I would say not. Like the example I gave, XML::Parser is used, because it is SAX-based and therefore very kind to memory - only the parts you specify are in memory once the parse is complete.

    From the doco for XML::Twig

    This module provides a way to process XML documents. It is build on top of XML::Parser.

    It allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing, ...

    As for XML::XPath, it doesn't mention it explicitly, but a quick surf through the source for its internal class XML::XPath::XMLParser, shows that it too uses XML::Parser. And it declares handlers for the events of XML::Parser. But it seems to build up an internal tree (of arrayrefs, and the author states the reason for this is speed). I see lots of code like $self->{current}->appendChild($node, 1);
    so perhaps it is does build an internal image first.

    Why not test with a huge XML doc and watch the memory footprint via ps or top

    use brain;

Re: XML::Twig -vs- XML::XPath
by mirod (Canon) on Oct 14, 2004 at 16:45 UTC
    So, to make sure I understand this, would I be correct in saying both XML::Twig and XML::XPath each read the entire XML file into memory once, with the difference being that XML::Twig gets rid of it as soon as it gets the node(s) it wants to further process?

    XML::XPath reads the entire document in memory, but XML::Twig does not necessarily.

    If you just call new with no arguments and then parse an XML document, then it will build the entire tree in memory. But if you use the twig_handlers option when you create the obect, then you can call handlers during the parsing, and within those handlers you can call the flush, purge or delete methods to get rid of parts of the tree. You can also use the twig_roots option to process only the elements you need, and not all of the tree.

    For more explanations you can have a look at the tutorial.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://398752]
Approved by ysth
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2014-09-18 01:34 GMT
Find Nodes?
    Voting Booth?

    How do you remember the number of days in each month?

    Results (103 votes), past polls