As far as your request for other (I wont presume better) ways to extract fragments of XML, I use the following approach.
Say you have a large XML document that contains a lot of Account nodes, only some of which you require.
I use XML::Parser to go through the XML file once first, using the Subs style. I have subroutines that catch the start and end of just the nodes I require. E.g. use XML::Parser;
...
my $p = XML::Parser->new(Style => 'Subs',...);
my %AccountLocations;
my $CurrentAccount;
$p->parsefile($xmlfile);
# Handler for <Account> start element
sub Account {
my ($parser, $tag, %attributes) =@_;
# store the byte-offset into the file of the required node
if (SomeConditionMet(%attributes) {
# remember it so we can use it for the closing tag
$CurrentAccount = $attributes{Number};
$AccountLocations{$CurrentAccount}->{Start} => $p->currentbyte()};
}
}
# Handler for </Account> end element
sub Account_ {
my ($p, $tag) = @_;
# XML::Parser::current_byte is the offset in the file to the start
# of the node. Add the length of what the parser
# recognised to get the offset to the end of the node
# is this the closing tag for a required account ?
if (defined $CurrentAccount) {
$Accounts{$CurrentAccount}->{End} =
$p->current_byte() + length($p->recognized_string());
$CurrentAccount = undef;
}
}
sub SomeConditionMet {
my (%attributes) = @_;
# your code goes here
...
}
This code just stores the start and end offsets of the fragments I really need. If a fragment doesn't satisfy SomeConditionMet(), it is bypassed. Later in the code, when I need to fully parse the fragment required, I have code like this - ...
use XML::Simple qw(:strict);
use IO::File;
use Carp;
...
my $XMLHandle = IO::File->new($xmlfile, O_RDONLY) or ...
foreach my $Account (keys %Accounts) {
# get an in-memory hash representation of the fragment
my $xml = LoadAccount($Account);
# process the in-memory hash of this fragment
# your code goes here
...
}
sub LoadAccount {
my ($Account) = @_;
my $length = $Accounts{$Account}->{End} -
$Accounts{$Account}->{Start};
my $xml;
# I have removed all error handling of the read
# and seek to simplify this example
# Jump to the required offset in the file
$XMLHandle->seek($Accounts{$Account}->{Start}, SEEK_SET);
$XMLHandle->read($xml, $length);
eval {
# parse the XML, using XML::Simple to get the resulting data
# structure the way you like it
$xs = XMLin($xml,
ForceArray => [qw(...)],
KeyAttr => []);
};
if ($@) {
$croak("Account $Account :: cannot parse xml $xml - $@");
}
return $xs;
}
This pretty much emulates what XML::Twig does, but you have pretty fine control of what is happening. It is also pretty fast. I used this technique to process ~4000 2k XML files in < 10 seconds, on my 1.8G RH9 box with Perl 5.8.0.
You could process the required elements in the closing tag handler of XML::Parser, if you didn't like building the index of offsets first.
|