Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Parsing a large file 80GB .gz file and making an output file with specifice columns in this original file.

by pillaipraveen (Initiate)
on Jul 16, 2013 at 11:03 UTC ( [id://1044546]=perlquestion: print w/replies, xml ) Need Help??

pillaipraveen has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a large .gz file of 80GB in size to read. Currently I am using the utility use IO::Uncompress::Gunzip to read this .gz file

Each line of the file contains 27,000 entries and I need only particular columns from this source file in the output file. I am now parsing every line in this file using the code,

while (defined(my $intensities = $INTENSITY_FILE->getline())) { my @intensities=split(/\t/,$intensities); @final_array=(); foreach($index=3;$index<scalar(@intensities);$index+=3) { push(@index_intensities,($index+1,$index+2)); } push(@final_array,(@intensities[@index_intensities])); print OUTFILE join("\t",@final_array); }
This will do the task and I am getting an OUTFILE with only the columns I need from the source file. I am looking for another method which could speed up this parsing process and getting the output into a new file. Any comments will be greatly appreciated. Thanks in advance, Praveen.

  • Comment on Parsing a large file 80GB .gz file and making an output file with specifice columns in this original file.
  • Select or Download Code

Replies are listed 'Best First'.
Re: Parsing a large file 80GB .gz file and making an output file with specifice columns in this original file.
by Corion (Patriarch) on Jul 16, 2013 at 11:32 UTC

    My approach is usually to parallelize the decompression and the string handling by using the two-argument form of open:

    my $filename= "some.file.gz"; my $cmd= "gunzip -cd '$filename' |"; open my $fh, $cmd or die "Couldn't decompress '$filename' via [$cmd]: $! / $?";

    This piped approach also works nicely with transfers over ssh connections.

      why would you want to use the two-argument form when you can use the five-argument form?
      open my $fh, '-|', 'gunzip', '-cd', $filename or die "Couldn't decompress '$filename' via gunzip: $! / $?";

        I use the two-argument form because the three-argument form fails for me on Windows, unfortunately:

        >perl -we "my $fn= shift; open fh, '|-', 'gunzip', '-cd', $fh or warn +$!" foo.gz List form of pipe open not implemented at -e line 1.
Re: Parsing a large file 80GB .gz file and making an output file with specifice columns in this original file.
by BrowserUk (Patriarch) on Jul 16, 2013 at 14:30 UTC

    I think Corion has the right idea, but I'd take it one step further and avoid having the perl script write directly to disk.

    I'd do it this way:

    gunzip -cd big.tsv.gz | perl -ne"@v=split chr(9),$_; $i=-1; print $v[$i+=3], chr(9) while $i + < @v; print chr(10)" | perl -pe1 > newfile

    NB: That's a one-liner wrapped across 3 lines for posting.

    The idea is to

    1. Minimise the memory juggling inside the first perl script.

      By avoiding building a second array of the columns you are keeping and then more memory in order to do the join.

    2. Use the pipe buffers between the first perl and the second to buffer the IO and prevent some disk contention.

    If you are going to be doing this regularly rather than just as a one-off, it might be worth trying both ways to see which works best on your system.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Parsing a large file 80GB .gz file and making an output file with specifice columns in this original file.
by Anonymous Monk on Jul 16, 2013 at 11:13 UTC
Re: Parsing a large file 80GB .gz file and making an output file with specifice columns in this original file.
by rjt (Curate) on Jul 16, 2013 at 11:28 UTC

    Where is the bottleneck in your program? The parsing code you've posted could probably be optimized, but your problem may well be IO/compression related (and that, too, is just a possibility).

    You need to profile your code to find out which subs/blocks are consuming the most time. My preference is Devel::NYTProf, which can produce very detailed HTML reports. From its SYNOPSIS:

    # profile code and write database to ./nytprof.out perl -d:NYTProf some_perl.pl # convert database into a set of html files, e.g., ./nytprof/index.h +tml # and open a web browser on the nytprof/index.html file nytprofhtml --open

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1044546]
Approved by salva
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-19 20:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found