Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

cols2lines.pl

by tfrayner (Curate)
on Jul 31, 2001 at 19:41 UTC ( [id://101196]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility scripts
Author/Contact Info Tim Rayner
tfrayner@yahoo.co.uk
Description: Time once again to reinvent the wheel, I suspect. I wrote this a couple of years back as an exercise in perl. Specifically, a friend of mine was wanting to manipulate large (>10GB) tables of data. Part of his analysis involved transposing a table such that the columns became lines and vice versa. As a result I wrote this rather convoluted script. It only loads one line of the table into memory at a time, rather than the whole table.

The other points of this exercise were to make the code as well-structured (whatever that means :-P) and as user-friendly as possible. I imagine it's possible to load the essential functionality into a single line, but that wasn't my aim here.

Of course, I imagine there's a perfectly good CPAN module out there which would do this much better than this :-)

Update: The original script opened and closed the input file with each column that it read. I've changed to the more efficient seek function suggested by RhetTbull.

#! /usr/bin/perl -w

# To Do: more error checking, get --delimiter to work for \t. 

# cols2lines.pl takes an input file with fields separated 
# by the chosen delimiter (default is the tab stop) and 
# switches the columns into rows (and vice versa, 
# obviously). The script can be called with the input file, 
# or both input and output files as its arguments. If one or 
# both are omitted from the command line the user is prompted 
# for their location. The delimiter can be defined using the 
# -d or --delimiter switch. The delimiter may be double-
# quoted if necessary. In non-interactive mode the output 
# file will not overwrite an existing file unless the -f or 
# --force switch is used. Use the -h or --help
# switch to get a usage summary.

use strict;
use Getopt::Long;

# DECLARE GLOBALS
my $delimiter = "\t"; # default delimiter is tab stop 
my $infile = "";
my $outfile = "";

# PARSE ARGUMENTS
sub parseargs{
    my $force;
    my $helptext;
    &GetOptions("d|delimiter=s" => \$delimiter, 
                "f|force"       => \$force, 
                "h|help"        => \$helptext);
    if ($helptext){
        die ("Usage: cols2lines [-h] [-f] [-d <delimiter>]".
                        " [input file] [output file]\n");
    }
    if (@ARGV){
        $infile=shift (@ARGV);
        unless (-e $infile){die ("File \'$infile\' not found.\n");}
        $outfile=shift (@ARGV);
        if ((-e $outfile) and !$force){
            die ("File \'$outfile\' aleady exists. Use -f to".
                        " force overwrite.\a\n");
        }
    } else {
        $infile="";
        $outfile="";
    }
    return ($infile, $outfile);
}

# ASSIGN INPUT FILE
sub queryinfile {
    $infile="";
    until (-e $infile){
        print ("Please enter the name of the input file:\n");
        $infile = <STDIN>;
        chomp ($infile);
        unless (-e $infile){
            print STDERR ("File \'$infile\' not found. ".
                                        "Ctrl-C to exit.\a\n");
        }
    }
    $infile;
}

# ASSIGN OUTPUT FILE
sub queryoutfile {
    print ("Please enter the name of the output file:\n");
    $outfile = <STDIN>;
    chomp ($outfile);
    if (-e $outfile){
        print STDERR ("File \'$outfile\' aleady exists. ".
                                        "Overwrite? [Y\/N]\a\n");
        my $answer = <STDIN>;
        chomp ($answer);
        if (lc($answer) ne 'y'){die ("Script aborted by user.\n");}
    }
    $outfile;
}

# PERFORM OPERATION
sub bigfile_colstolines {
    my $infile = shift;
    my $outfile = shift;
    my $infilehandle = "<$infile";          # read-only
    open (INFILE, $infilehandle) or die ("File error.\a\n");
    my $outfilehandle = ">$outfile";       # write only
    open (OUTFILE, $outfilehandle) or die ("Output failure.\a\n");
    my $line = <INFILE>;
    my @testarray = split (/$delimiter/, $line);
    seek(INFILE,0,0);
    for (my $counter=0; $counter <= $#testarray; $counter++){
        my @columnarray = undef();
        while (defined ($line = <INFILE>)){
            chomp ($line);
            my @linearray = split (/$delimiter/, $line);
            push (@columnarray, $linearray [$counter]);
        }
        shift (@columnarray);               # removes unwanted charact
+ers
        my $newline = join $delimiter, (@columnarray);
        print OUTFILE ($newline, "\n");
        seek(INFILE,0,0);
    }
    close(INFILE);
    close(OUTFILE);
}

# MAIN LOOP

($infile, $outfile)=&parseargs;
unless ($infile){$infile=&queryinfile;}
unless ($outfile){$outfile=&queryoutfile;}
&bigfile_colstolines ($infile, $outfile);
print ("Done.\n");
exit
Replies are listed 'Best First'.
Re: cols2lines.pl
by RhetTbull (Curate) on Jul 31, 2001 at 21:18 UTC
    Nice idea -- it's something I've had to do before. I would recommed two changes.

    1. Instead of specifying the files on the command line, use the Unix "filter" paradigm where you read in a file (either from file or STDIN) and write it out to STDOUT. That way the user could do something like (depending on their shell): cols2lines.pl bigfile > bigfile2 or something like:
    for file in *.mat; do echo $file; ./cols2lines.pl $file > $file.2; done in order to process a bunch of files.

    2. Don't open and close the file so many times! Use seek instead. It's probably faster. For a file with many cols, you will open and close the file a lot -- that takes up time. I did a quick benchmark and on my system here are the results from reading a large file hundreds of times:

    Benchmark: timing 100 iterations of openclose, seek... openclose: 186 wallclock secs (161.45 usr + 19.77 sys = 181.22 CPU) @ + 0.55/s (n=100) seek: 17 wallclock secs (16.08 usr + 1.02 sys = 17.10 CPU) @ 5.85/s +(n=100)
    Because your program is doing a lot of I/O and other things (like pushing stuff onto big arrays) not all your time is spent opening and closing files so the speedup won't be as dramatic as the simple benchmark but it will be faster. I've made a small change (changed 3 lines) to your program to use seek instead of repeated open/close. Using the modified code on a file with 1000 columns, it ran about 25% faster than yours (a significant improvement if the file is really big).
      Thanks for the tips. I must admit, I hadn't appreciated the overhead involved with the repeated open/close operations. Turns out seek is my new best friend :-)

      The reason the STDIN/STDOUT unix filter paradigm wasn't implemented in this script has more to do with its original development than with with its final functionality. The first script was written on a Mac and I haven't found a good way to deal with STDOUT using MacPerl. Although it did make for some easily-implemented dialogue boxes :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://101196]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-20 03:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found