Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Merging Many Files Side by Side

by sesemin (Beadle)
on Feb 19, 2009 at 19:21 UTC ( [id://745168]=perlquestion: print w/replies, xml ) Need Help??

sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have written the following script to read the second column of many tab delimited files (~200) and merge them side by side in an output file. The input files have three columns, tab delimited separated and 6.9 million lines each. It will take forever, if I want to read and write 6.9 million lines. Do you have any other better solution to do it quicker and more efficient.

#!/usr/bin/perl -w my(@handles); unlink"Results.txt"; #would loop if already present for(<*.TI>){ open($handles[@handles],$_); } open(OUTFILE,">Results.txt"); my$atleastone=1; while($atleastone){ $atleastone=0; for my$op(@handles){ if($_=readline($op)){ my@col=split; $col[1]+=0; #otherwise you print nothing but a \t if column 2 i +s undef print OUTFILE"$col[1]\t"; $atleastone=1; }else{ print OUTFILE"0\t"; } } print OUTFILE"\n"; } undef@handles; #closes all files close(OUTFILE);

Replies are listed 'Best First'.
Re: Merging Many Files Side by Side
by ELISHEVA (Prior) on Feb 19, 2009 at 19:34 UTC

    You could save some memory consumption (and perhaps reduce slowdown due to memory paging) if you opened the files one by one. By opening all the files at once, you are also opening 200 buffers for each of the files all at once.

    1. Create an array of arrays.
    2. Read the first file one line at a time, and place all of column 2's values in $aa[0];
    3. Read the second file one line at a time, and place all of column 2's values in $aa[1];
    4. And so on...
    5. Loop through array and print to file

    With files that large, you might also want to consider tying the arrays holding the columns to random access files. I once had a ten-fold speed up (in a C++ program) just by using temp files instead of RAM to store data while I was processing. See Tie::File

    If your columns are fixed width, you might be able to to avoid two cycles, one to read the array and one to print the array, by keeping a variable that stores the current line length. Then instead of an array of arrays, you could just:

    1. Read the first file
      1. Write the first file by placing each col 2 value on a separate line.
      2. Increase the line length by one column width
    2. Read the next file
      1. Seek to the end of the line, insert the column
      2. Seek to end of the column you just wrote
      3. Seek to end of next line, insert the column
      4. After all lines have been read, increase line length by one column width
    3. Repeat until all files are processed

    Best, beth

Re: Merging Many Files Side by Side
by atemon (Chaplain) on Feb 19, 2009 at 20:36 UTC

    Hi,

    Try CPAN module Tie::File
    It says:

    Tie::File represents a regular text file as a Perl array. Each element in the array corresponds to a record in the file. The first line of the file is element 0 of the array; the second line is element 1, and so on.

    The file is not loaded into memory, so this will work even for gigantic files.

    Changes to the array are reflected in the file immediately.

    Important fact is, file is never loaded into memory. You can tie all your files including 'result file' and write to 'result file' without loading neither of them to memory. Writing to new file would be just a push to array tied to result file.
    Hope this helps.

    Cheers !

    --VC

Re: Merging Many Files Side by Side
by repellent (Priest) on Feb 20, 2009 at 00:14 UTC
    This sounds like a job for cut and paste, if you're on a Unix platform.
      I used to use paste and cut. paste file1 file2 ...|cut -f2,3 > newfile.txt and actually it works. The timing with that approach is not that bad ~ 30 min or so for 200 files. However, I wanted to be more creative and use my Perl skills. Thank you very much anyway.
Re: Merging Many Files Side by Side
by gone2015 (Deacon) on Feb 20, 2009 at 00:37 UTC

    I assume you've tried it and discovered it's too slow ?

    200 files at 6.9M lines each, each line three fields tab delimited -- if that's 60 bytes per line, you're reading ~80G bytes and writing ~27G (assuming roughly equal field sizes). On my little Linux box, I just timed cp foo bar for an ~13G file and it took ~15mins -- so 80G + 27G looks like about two hours work ? Of course, with faster drives and what-not, you may do better.

    I've never tried opening 200 files at once...

    It's hard to say what the processing time will add to this. It looks pretty I/O bound. If processing time is an issue, then I'd look at reading the files a chunk at a time... but I'd have to be convinced there was a problem; and even then I'm not sure that processing chunks of the files in Perl would be quicker than using Perl to read a line at a time.

    So... what are you expecting, and what do you get when you try the straightforward approach ?

    Wishing to squeeze as much as possible out of the inner loop, I think the following may be faster:

    my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my @l = () ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + push @l, $c2 || "0" ; } ; print OUTFILE join("\t", @l), "\n" ; } ;
    or possibly:
    my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
    You seem concerned that column 2 might be empty, or that there may be blank or whitespace only lines in the input files. If you could guarantee that no line would be empty and no line would be whitespace only, then:
    my $atleastone = 1; while (defined($atleastone)) { $atleastone = undef ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ($atleastone, $c2) = split if defined($_ = <$op>) ; $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
    squeezes out a tiny fraction ! (Setting $atleastone to something (even 0 or "") if <$op> returns a line.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://745168]
Approved by mr_mischief
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-15 13:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found