Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: How to read compressed (gz) file in xml::twig

by CSharma (Sexton)
on Feb 20, 2017 at 12:04 UTC ( [id://1182342]=note: print w/replies, xml ) Need Help??


in reply to Re: How to read compressed (gz) file in xml::twig
in thread How to read compressed (gz) file in xml::twig

Thanks Hauke for the solution! That works but script takes a lot time in providing the output. Can this be reduced? Gzipped file is of 45MB. Total children (Offer) are 158K.

And, second question wasn't related to the first; that was different. Here is the code snippet.
my $file = 'Offerfeed_11742413_uk.full.xml.gz'; my $z = IO::Uncompress::Gunzip->new($file) or die "gunzip failed: $IO: +:Uncompress::Gunzip::GunzipError\n"; my $twig = new XML::Twig; ## Get twig object $twig->parse($z); ## parse the file to build twig my $root = $twig->root; ## Get the root element of twig my @elements = $root->children; ## Get elements list of twig my $ct = 0; foreach my $e (sort @elements){ my $cpc = ($e->first_child('EstimatedCPC')->text)*100; print $cpc,"\n"; $ct++; } print $ct,"\n";

Replies are listed 'Best First'.
Re^3: How to read compressed (gz) file in xml::twig
by marto (Cardinal) on Feb 20, 2017 at 12:11 UTC

    45MB unzipped is going to be a lot of data, take a look at some of these file sizes, XML vs gzip. Either profile your code to see if improvements can be made (see the documentation for advice on huge documents), or invest in faster CPU, disks, much more RAM...

Re^3: How to read compressed (gz) file in xml::twig
by haukex (Archbishop) on Feb 20, 2017 at 16:07 UTC

    Hi CSharma,

    but script takes a lot time in providing the output

    How long does it take to gunzip the file and then process it with your existing script? How much longer does the above code take? To get a somewhat decent comparison, try piping the output of gunzip into your script (it'll need a slight modification to read from STDIN).

    One thing that might* speed things up is if you make use of XML::Twig's ability to parse an XML file in chunks, instead of reading the whole thing into memory like you're currently doing.

    use warnings; use strict; use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new( twig_roots => { '/CatalogListings/Offer/EstimatedCPC' => sub { my ($t, $elt) = @_; print $elt->text*100, "\n"; $t->purge; }, }, ); $twig->parse($z); $z->close;

    This produces the same output as before, but discards each <EstimatedCPC> element when it's done processing it, and ignores the other elements.

    (* The code works, but I haven't had the chance to do a performance test.)

    Hope this helps,
    -- Hauke D

      Thanks a lot Hauke!! The code worked and it's certainly better than earlier. Thanks, Chetan

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1182342]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-26 00:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found