http://www.perlmonks.org?node_id=101236


in reply to cols2lines.pl

Nice idea -- it's something I've had to do before. I would recommed two changes.

1. Instead of specifying the files on the command line, use the Unix "filter" paradigm where you read in a file (either from file or STDIN) and write it out to STDOUT. That way the user could do something like (depending on their shell): cols2lines.pl bigfile > bigfile2 or something like:
for file in *.mat; do echo $file; ./cols2lines.pl $file > $file.2; done in order to process a bunch of files.

2. Don't open and close the file so many times! Use seek instead. It's probably faster. For a file with many cols, you will open and close the file a lot -- that takes up time. I did a quick benchmark and on my system here are the results from reading a large file hundreds of times:

Benchmark: timing 100 iterations of openclose, seek... openclose: 186 wallclock secs (161.45 usr + 19.77 sys = 181.22 CPU) @ + 0.55/s (n=100) seek: 17 wallclock secs (16.08 usr + 1.02 sys = 17.10 CPU) @ 5.85/s +(n=100)
Because your program is doing a lot of I/O and other things (like pushing stuff onto big arrays) not all your time is spent opening and closing files so the speedup won't be as dramatic as the simple benchmark but it will be faster. I've made a small change (changed 3 lines) to your program to use seek instead of repeated open/close. Using the modified code on a file with 1000 columns, it ran about 25% faster than yours (a significant improvement if the file is really big).
Here's your sub bigfiles_colstolines modified to use seek:
sub bigfile_colstolines { my $infile = shift; my $outfile = shift; my $infilehandle = "<$infile"; # read-only open (INFILE, $infilehandle) or die ("File error.\a\n"); my $outfilehandle = ">$outfile"; # write only open (OUTFILE, $outfilehandle) or die ("Output failure.\a\n"); my $line = <INFILE>; my @testarray = split (/$delimiter/, $line); close (INFILE); open (INFILE, $infilehandle) or die ("File error.\a\n"); for (my $counter=0; $counter <= $#testarray; $counter++){ my @columnarray = undef(); while (defined ($line = <INFILE>)){ chomp ($line); my @linearray = split (/$delimiter/, $line); push (@columnarray, $linearray [$counter]); } shift (@columnarray); # removes unwanted characters my $newline = join $delimiter, (@columnarray); print OUTFILE ($newline, "\n"); #rewind the file seek(INFILE,0,0); } close(INFILE); close(OUTFILE); }