Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: XML::Twig -vs- XML::XPath

by leriksen (Curate)
on Oct 13, 2004 at 03:24 UTC ( #398758=note: print w/ replies, xml ) Need Help??


in reply to XML::Twig -vs- XML::XPath

As far as your request for other (I wont presume better) ways to extract fragments of XML, I use the following approach.

Say you have a large XML document that contains a lot of Account nodes, only some of which you require.
I use XML::Parser to go through the XML file once first, using the Subs style. I have subroutines that catch the start and end of just the nodes I require. E.g.

use XML::Parser; ... my $p = XML::Parser->new(Style => 'Subs',...); my %AccountLocations; my $CurrentAccount; $p->parsefile($xmlfile); # Handler for <Account> start element sub Account { my ($parser, $tag, %attributes) =@_; # store the byte-offset into the file of the required node if (SomeConditionMet(%attributes) { # remember it so we can use it for the closing tag $CurrentAccount = $attributes{Number}; $AccountLocations{$CurrentAccount}->{Start} => $p->currentbyte()}; } } # Handler for </Account> end element sub Account_ { my ($p, $tag) = @_; # XML::Parser::current_byte is the offset in the file to the start # of the node. Add the length of what the parser # recognised to get the offset to the end of the node # is this the closing tag for a required account ? if (defined $CurrentAccount) { $Accounts{$CurrentAccount}->{End} = $p->current_byte() + length($p->recognized_string()); $CurrentAccount = undef; } } sub SomeConditionMet { my (%attributes) = @_; # your code goes here ... }

This code just stores the start and end offsets of the fragments I really need. If a fragment doesn't satisfy SomeConditionMet(), it is bypassed. Later in the code, when I need to fully parse the fragment required, I have code like this -

... use XML::Simple qw(:strict); use IO::File; use Carp; ... my $XMLHandle = IO::File->new($xmlfile, O_RDONLY) or ... foreach my $Account (keys %Accounts) { # get an in-memory hash representation of the fragment my $xml = LoadAccount($Account); # process the in-memory hash of this fragment # your code goes here ... } sub LoadAccount { my ($Account) = @_; my $length = $Accounts{$Account}->{End} - $Accounts{$Account}->{Start}; my $xml; # I have removed all error handling of the read # and seek to simplify this example # Jump to the required offset in the file $XMLHandle->seek($Accounts{$Account}->{Start}, SEEK_SET); $XMLHandle->read($xml, $length); eval { # parse the XML, using XML::Simple to get the resulting data # structure the way you like it $xs = XMLin($xml, ForceArray => [qw(...)], KeyAttr => []); }; if ($@) { $croak("Account $Account :: cannot parse xml $xml - $@"); } return $xs; }

This pretty much emulates what XML::Twig does, but you have pretty fine control of what is happening. It is also pretty fast. I used this technique to process ~4000 2k XML files in < 10 seconds, on my 1.8G RH9 box with Perl 5.8.0.

You could process the required elements in the closing tag handler of XML::Parser, if you didn't like building the index of offsets first.

use brain;


Comment on Re: XML::Twig -vs- XML::XPath
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://398758]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (13)
As of 2015-07-31 19:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (280 votes), past polls