http://www.perlmonks.org?node_id=398758


in reply to XML::Twig -vs- XML::XPath

As far as your request for other (I wont presume better) ways to extract fragments of XML, I use the following approach.

Say you have a large XML document that contains a lot of Account nodes, only some of which you require.
I use XML::Parser to go through the XML file once first, using the Subs style. I have subroutines that catch the start and end of just the nodes I require. E.g.

use XML::Parser; ... my $p = XML::Parser->new(Style => 'Subs',...); my %AccountLocations; my $CurrentAccount; $p->parsefile($xmlfile); # Handler for <Account> start element sub Account { my ($parser, $tag, %attributes) =@_; # store the byte-offset into the file of the required node if (SomeConditionMet(%attributes) { # remember it so we can use it for the closing tag $CurrentAccount = $attributes{Number}; $AccountLocations{$CurrentAccount}->{Start} => $p->currentbyte()}; } } # Handler for </Account> end element sub Account_ { my ($p, $tag) = @_; # XML::Parser::current_byte is the offset in the file to the start # of the node. Add the length of what the parser # recognised to get the offset to the end of the node # is this the closing tag for a required account ? if (defined $CurrentAccount) { $Accounts{$CurrentAccount}->{End} = $p->current_byte() + length($p->recognized_string()); $CurrentAccount = undef; } } sub SomeConditionMet { my (%attributes) = @_; # your code goes here ... }

This code just stores the start and end offsets of the fragments I really need. If a fragment doesn't satisfy SomeConditionMet(), it is bypassed. Later in the code, when I need to fully parse the fragment required, I have code like this -

... use XML::Simple qw(:strict); use IO::File; use Carp; ... my $XMLHandle = IO::File->new($xmlfile, O_RDONLY) or ... foreach my $Account (keys %Accounts) { # get an in-memory hash representation of the fragment my $xml = LoadAccount($Account); # process the in-memory hash of this fragment # your code goes here ... } sub LoadAccount { my ($Account) = @_; my $length = $Accounts{$Account}->{End} - $Accounts{$Account}->{Start}; my $xml; # I have removed all error handling of the read # and seek to simplify this example # Jump to the required offset in the file $XMLHandle->seek($Accounts{$Account}->{Start}, SEEK_SET); $XMLHandle->read($xml, $length); eval { # parse the XML, using XML::Simple to get the resulting data # structure the way you like it $xs = XMLin($xml, ForceArray => [qw(...)], KeyAttr => []); }; if ($@) { $croak("Account $Account :: cannot parse xml $xml - $@"); } return $xs; }

This pretty much emulates what XML::Twig does, but you have pretty fine control of what is happening. It is also pretty fast. I used this technique to process ~4000 2k XML files in < 10 seconds, on my 1.8G RH9 box with Perl 5.8.0.

You could process the required elements in the closing tag handler of XML::Parser, if you didn't like building the index of offsets first.

use brain;