Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

parse XML huge file using cpan modules

by nicopelle (Acolyte)
on Jul 28, 2019 at 12:04 UTC ( #11103530=perlquestion: print w/replies, xml ) Need Help??

nicopelle has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all perlmonks users, and thanks all and in particular to whom will waste some minutes with my silly questions

I'm really losing my mind with a file formatted like this:

<?xml version="1.0" encoding="UTF-8"?> <ctgStatistics xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ctgstatslog.xsd"> <statRecord type="interval" length="60" time="2019-07-16T08:23:59"> <resourceGroup name="CSCS1SVGM1"> <statistic type="Startup"> <name>SPROTOCOL</name> <value type="String">TCPIP</value> </statistic> <statistic type="Lifetime"> <name>LCONNFAIL</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LLOSTCONN</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LIDLETIMEOUT</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LALLREQ</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LREQDATA</name> <value type="Long">0</value> </statistic> <statistic type="Lifetime"> <name>LRESPDATA</name> <value type="Long">0</value> </statistic> <statistic type="Current"> <name>CREQCURR</name> <value type="Integer">0</value> </statistic> <statistic type="Current"> <name>CWAITING</name> <value type="Integer">0</value> </statistic> <statistic type="Current"> <name>CORPHANREQ</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LORPHANREQ</name> <value type="Integer">0</value> </statistic> <statistic type="Current"> <name>CTERM</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LTERMINST</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LTERMUNINST</name> <value type="Integer">0</value> </statistic> <statistic type="Lifetime"> <name>LAVRESP</name> <value type="Integer">0</value> </statistic> <statistic type="Startup"> <name>SIPADDR</name> <value type="String">amlif700txs001</value> </statistic> <statistic type="Startup"> <name>SIPPORT</name> <value type="Integer">28218</value> </statistic> </resourceGroup> <resourceGroup name="CSCS1SVGM1"> ..... <resourceGroup name="ANOTHER_ONE_AND_SO_ON> ... </resourceGroup> </statRecord> </ctgStatistics>>

The data::dumper config is:

$VAR1 = { 'xsi:noNamespaceSchemaLocation' => 'ctgstatslog.xsd', 'statRecord' => [ { 'resourceGroup' => { 'CSCS1GFVM1' => { ' +statistic' => { + 'LRESPDATA' => { + 'value' => { + 'content' => '0', + 'type' => 'Long' + }, + 'type' => 'Lifetime' + }, + 'CREQCURR' => { + 'type' => 'Current', + 'value' => { + 'type' => 'Integer', + 'content' => '0' + } + }, + 'SIPADDR' => { + 'value' => { + 'type' => 'String', + 'content' => 'amlif700txs +001' + }, + 'type' => 'Startup' + }, + 'LLOSTCONN' => { + 'type' => 'Lifetime', + 'value' => { + 'type' => 'Integer', + 'content' => '0' + } + }, + 'CORPHANREQ' => { + 'value' => { + 'content' => '0', + 'type' => 'Integer'

This file contains 24h statistics, and the size is 82Mb (one file each day).
There is a section:

<statRecord type="interval" length="60" time="2019-07-16T08:23:59">
every 60secs.
My needed would be an output, more or less like this:
time|resourceGroup name|LCONNFAIL|LLOSTCONN|LIDLETIMEOUT|SIPADDR|SIPPO +RT 2019-07-16T08:23:59|CSCS1SVGM1|0|0|0|amlif700txs001|28218 ..and..so..on..until.the.on.file.

I've just tried XML::Twig, XML::LibXML and use Data::Dumper with XML::Simple(which is not so simply for me;-) ) with no success :(.
Thanks, again, to anyone who want to show me the right path/way to follow.

Replies are listed 'Best First'.
Re: parse XML huge file using cpan modules
by choroba (Archbishop) on Jul 28, 2019 at 19:48 UTC
    For really large files, you can use XML::LibXML::Reader.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML::Reader; my $r = 'XML::LibXML::Reader'->new(location => shift); say 'time|resourceGroup|LCONNFAIL|LLOSTCONN|LIDLETIMEOUT|SIPADDR|SIPPO +RT'; my %dispatch = ( statRecord => sub { print $r->getAttribute('time'), '|'; }, resourceGroup => sub { print $r->getAttribute('name'); }, name => sub { my $name = $r->readInnerXml; return unless $name =~ /^(?:LCONNFAIL|LLOSTCONN|LIDLETIMEOUT|SIPADDR|SIPPORT)$ +/; $r->nextSiblingElement('value'); print '|', $r->readInnerXml; print "\n" if $name eq 'SIPPORT'; } ); while ($r->read) { next unless $r->nodeType == 1; my $action = $dispatch{ $r->name }; $action->() if $action; }

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      @choroba , your code works flawless too ! Thanks for your support and not least for your teachings.
Re: parse XML huge file using cpan modules
by poj (Abbot) on Jul 28, 2019 at 16:10 UTC

    Try building a hash from the name/value elements within each resourceGroup. Example using XML::Twig

    #!/usr/bin/perl use strict; use XML::Twig; #time|resourceGroup name|LCONNFAIL|LLOSTCONN|LIDLETIMEOUT|SIPADDR|SIPP +ORT my @columns = ('resourceGroup', 'LCONNFAIL','LLOSTCONN','LIDLETIMEOUT','SIPADDR','SIPPORT'); my $time; my %record = (); # tag handler my $twig = XML::Twig->new( twig_handlers => { resourceGroup => \&resourceGroup, statistic => \&statistic, }, start_tag_handlers => { statRecord => sub { $time = $_[1]->att('time') } } ); # process file print join '|','time',@columns; print "\n"; $twig->parsefile( 'my_file.xml' ); sub resourceGroup { my ($t,$e) = @_; $record{'resourceGroup'} = $e->att('name'); # print record print join "|",$time,map{$record{$_}}@columns; print "\n"; $t->purge; %record = (); } sub statistic { my ($t,$e) = @_; my $name = $e->first_child_text('name'); $record{$name} = $e->first_child_text('value'); }
    poj
      @Poj your code work flawless !
      Thanks for your support!!!!.
      And overall, thanks for teaching me how to interact correctly with the xml file
Re: parse XML huge file using cpan modules
by Corion (Pope) on Jul 28, 2019 at 13:23 UTC

    How did your approaches fail? You already managed to read in the data, as your output using Data::Dumper shows, so where is your problem?

    Is your problem to output the CSV / pipe delimited data?

    Can you please edit your post and add the code you already have, and explain where exactly your problem is?

      Hi Corion, thanks for your fast reply. My issue is to accesso to data/field. I'm not able to print this field/value:
      2019-07-16T08:23:59|CSCS1SVGM1|0|0|0|amlif700txs001|28218

        Again, please show us the code you already have, and also explain to us where each value comes from.

        Without seeing the code you already have, it is very hard to give you concrete advice on how to change your code.

Re: parse XML huge file using cpan modules
by BillKSmith (Prior) on Jul 28, 2019 at 15:30 UTC
    I am confused. I first assumed that the data you show is a 60-sec sample extracted from a 24-hr file. However, the length of this file times 60 minutes times 24 hours is not even close to the 82Mb which you specify. The expected output that your posted appears to be a header line and a summary of the data sample given. The first problem that I see is that much of that data is not available in the hash that you posted.
    Bill
      Thanks for your time Bill. As explained above, the Poj's solution is really fine and works flawless.
        I understand that poj showed you how to parse the file. You still have the discrepancy in the file size. Are you sure that this is not a symptom of another problem?
        Bill
Re: parse XML huge file using cpan modules
by Jenda (Abbot) on Jul 29, 2019 at 12:12 UTC

    To add to the list of options...

    use strict; use XML::Rules; use Data::Dumper qw(Dumper); my $parser = XML::Rules->new( stripspaces => 15, rules => { 'name,value' => 'content', statistic => sub { return '%' . $_[1]->{type} => { $_[1]->{nam +e} => $_[1]->{value}} }, resourceGroup => 'no content array', statRecord => sub { #print Dumper($_[1]); foreach my $group (@{$_[1]->{resourceGroup}}) { print "$_[1]->{time}|$group->{name}|$group->{Lifetime} +{LCONNFAIL}|$group->{Lifetime}{LLOSTCONN}|$group->{Lifetime}{LIDLETIM +EOUT}|$group->{Startup}{SIPADDR}|$group->{Startup}{SIPPORT}\n"; } return; } } ); print "time|resourceGroup name|LCONNFAIL|LLOSTCONN|LIDLETIMEOUT|SIPADD +R|SIPPORT\n"; $parser->parse(\*DATA); __DATA__ <?xml version="1.0" encoding="UTF-8"?> <ctgStatistics xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ctgstatslog.xsd"> <statRecord type="interval" length="60" time="2019-07-16T08:23:59"> <resourceGroup name="CSCS1SVGM1"> <statistic type="Startup"> ...
    or
    use strict; use XML::Rules; use Data::Dumper qw(Dumper); my $parser = XML::Rules->new( stripspaces => 15, rules => { 'name,value' => 'content', statistic => sub { return $_[1]->{name} => $_[1]->{value} }, resourceGroup => 'no content array', statRecord => sub { #print Dumper($_[1]); foreach my $group (@{$_[1]->{resourceGroup}}) { print "$_[1]->{time}|$group->{name}|$group->{LCONNFAIL +}|$group->{LLOSTCONN}|$group->{LIDLETIMEOUT}|$group->{SIPADDR}|$group +->{SIPPORT}\n"; } return; } } ); print "time|resourceGroup name|LCONNFAIL|LLOSTCONN|LIDLETIMEOUT|SIPADD +R|SIPPORT\n"; $parser->parse(\*DATA); __DATA__ <?xml version="1.0" encoding="UTF-8"?> <ctgStatistics xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ctgstatslog.xsd"> <statRecord type="interval" length="60" time="2019-07-16T08:23:59"> <resourceGroup name="CSCS1SVGM1"> <statistic type="Startup"> ...

    The first version preserves the statistics type in the data provided to the handler of the statRecord tag, the second assumes there will be no duplicate names of statistics and ignores the types.

    There's only the data from one <statRecord> in memory at any time.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Thanks Jenda!!
Re: parse XML huge file using cpan modules
by Anonymous Monk on Jul 30, 2019 at 17:44 UTC
    There are generally two ways to handle XML: "LibXML," which uses an industry-standard binary library to turn the XML into an in-memory data structure, and "Twig," which walks through the data invoking subroutines along the way (but without reading it all into memory). Both solutions are known to work correctly with any XML data. If you're processing a terabyte XML file with gigabytes of memory which sometimes happens use Twig. It can do it.
      XML::LibXML can do it as well, as was shown here in this very thread. XML::LibXML::Reader works similarly to XML::Twig - it doesn't keep the whole data in memory, but if you tell it to, it can "inflate" a part of the data into a full-featured XML::LibXML™ object you can process using all the available methods.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Well, yes but no.

      There are more ways and often your hailed "industry-standard binary libraries" support several.

      You can use one of several libraries to load the whole file into memory as a huge maze of objects and then search and navigate the maze using methods and sublanguages like XPath.

      You can use one of several libraries to load the whole file into memory as a huge memory structure (possibly with a bit of tie() magic) and navigate it using normal Perl tools. You should NOT use XML::Simple for that 'cause it produces inconsistent data structures! If the data structure is your goal, then have a look at XML::Rules, it would allow you to produce a consistent structure and trim it along the way.

      You can use one of several libraries to have them call your handlers whenever they find another bit of whatever in the XML and take care of knowing where the heck you are in the structure yourself. Good luck with that! Industry standard or no industry standard. It's a mess.

      You can use one of several libraries to give you the next bit whenever you ask for it and take care of knowing where the heck you are in the structure yourself. Good luck with that! Industry standard or no industry standard. It's a mess.

      You can use XML::Twig to call your handler whenever it finishes parsing a reasonably large, easy to digest chunk of the XML (a twig) and have it provide you with the data from the twig either as a maze of objects or a data structure.

      You can use XML::Rules to call your handler whenever it finishes parsing a reasonably large, easy to digest chunk of the XML and have it provide you with the data from the chunk as a data structure built according to the rules you provided, handle or massage the data in any way you need and have the result made available to the handler of an enclosing chunk and thus either process the file as you go or build a modified, trimmed down data structure.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

      Thanks for your tips.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11103530]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2019-12-07 16:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (162 votes). Check out past polls.

    Notices?