Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
good chemistry is complicated,
and a little bit messy -LW

by tfrayner (Curate)
on Jul 31, 2001 at 19:41 UTC ( #101196=sourcecode: print w/ replies, xml ) Need Help??

Category: Utility scripts
Author/Contact Info Tim Rayner
Description: Time once again to reinvent the wheel, I suspect. I wrote this a couple of years back as an exercise in perl. Specifically, a friend of mine was wanting to manipulate large (>10GB) tables of data. Part of his analysis involved transposing a table such that the columns became lines and vice versa. As a result I wrote this rather convoluted script. It only loads one line of the table into memory at a time, rather than the whole table.

The other points of this exercise were to make the code as well-structured (whatever that means :-P) and as user-friendly as possible. I imagine it's possible to load the essential functionality into a single line, but that wasn't my aim here.

Of course, I imagine there's a perfectly good CPAN module out there which would do this much better than this :-)

Update: The original script opened and closed the input file with each column that it read. I've changed to the more efficient seek function suggested by RhetTbull.

#! /usr/bin/perl -w

# To Do: more error checking, get --delimiter to work for \t. 

# takes an input file with fields separated 
# by the chosen delimiter (default is the tab stop) and 
# switches the columns into rows (and vice versa, 
# obviously). The script can be called with the input file, 
# or both input and output files as its arguments. If one or 
# both are omitted from the command line the user is prompted 
# for their location. The delimiter can be defined using the 
# -d or --delimiter switch. The delimiter may be double-
# quoted if necessary. In non-interactive mode the output 
# file will not overwrite an existing file unless the -f or 
# --force switch is used. Use the -h or --help
# switch to get a usage summary.

use strict;
use Getopt::Long;

my $delimiter = "\t"; # default delimiter is tab stop 
my $infile = "";
my $outfile = "";

sub parseargs{
    my $force;
    my $helptext;
    &GetOptions("d|delimiter=s" => \$delimiter, 
                "f|force"       => \$force, 
                "h|help"        => \$helptext);
    if ($helptext){
        die ("Usage: cols2lines [-h] [-f] [-d <delimiter>]".
                        " [input file] [output file]\n");
    if (@ARGV){
        $infile=shift (@ARGV);
        unless (-e $infile){die ("File \'$infile\' not found.\n");}
        $outfile=shift (@ARGV);
        if ((-e $outfile) and !$force){
            die ("File \'$outfile\' aleady exists. Use -f to".
                        " force overwrite.\a\n");
    } else {
    return ($infile, $outfile);

sub queryinfile {
    until (-e $infile){
        print ("Please enter the name of the input file:\n");
        $infile = <STDIN>;
        chomp ($infile);
        unless (-e $infile){
            print STDERR ("File \'$infile\' not found. ".
                                        "Ctrl-C to exit.\a\n");

sub queryoutfile {
    print ("Please enter the name of the output file:\n");
    $outfile = <STDIN>;
    chomp ($outfile);
    if (-e $outfile){
        print STDERR ("File \'$outfile\' aleady exists. ".
                                        "Overwrite? [Y\/N]\a\n");
        my $answer = <STDIN>;
        chomp ($answer);
        if (lc($answer) ne 'y'){die ("Script aborted by user.\n");}

sub bigfile_colstolines {
    my $infile = shift;
    my $outfile = shift;
    my $infilehandle = "<$infile";          # read-only
    open (INFILE, $infilehandle) or die ("File error.\a\n");
    my $outfilehandle = ">$outfile";       # write only
    open (OUTFILE, $outfilehandle) or die ("Output failure.\a\n");
    my $line = <INFILE>;
    my @testarray = split (/$delimiter/, $line);
    for (my $counter=0; $counter <= $#testarray; $counter++){
        my @columnarray = undef();
        while (defined ($line = <INFILE>)){
            chomp ($line);
            my @linearray = split (/$delimiter/, $line);
            push (@columnarray, $linearray [$counter]);
        shift (@columnarray);               # removes unwanted charact
        my $newline = join $delimiter, (@columnarray);
        print OUTFILE ($newline, "\n");


($infile, $outfile)=&parseargs;
unless ($infile){$infile=&queryinfile;}
unless ($outfile){$outfile=&queryoutfile;}
&bigfile_colstolines ($infile, $outfile);
print ("Done.\n");

Comment on
Download Code
by RhetTbull (Curate) on Jul 31, 2001 at 21:18 UTC
    Nice idea -- it's something I've had to do before. I would recommed two changes.

    1. Instead of specifying the files on the command line, use the Unix "filter" paradigm where you read in a file (either from file or STDIN) and write it out to STDOUT. That way the user could do something like (depending on their shell): bigfile > bigfile2 or something like:
    for file in *.mat; do echo $file; ./ $file > $file.2; done in order to process a bunch of files.

    2. Don't open and close the file so many times! Use seek instead. It's probably faster. For a file with many cols, you will open and close the file a lot -- that takes up time. I did a quick benchmark and on my system here are the results from reading a large file hundreds of times:

    Benchmark: timing 100 iterations of openclose, seek... openclose: 186 wallclock secs (161.45 usr + 19.77 sys = 181.22 CPU) @ + 0.55/s (n=100) seek: 17 wallclock secs (16.08 usr + 1.02 sys = 17.10 CPU) @ 5.85/s +(n=100)
    Because your program is doing a lot of I/O and other things (like pushing stuff onto big arrays) not all your time is spent opening and closing files so the speedup won't be as dramatic as the simple benchmark but it will be faster. I've made a small change (changed 3 lines) to your program to use seek instead of repeated open/close. Using the modified code on a file with 1000 columns, it ran about 25% faster than yours (a significant improvement if the file is really big).
      Thanks for the tips. I must admit, I hadn't appreciated the overhead involved with the repeated open/close operations. Turns out seek is my new best friend :-)

      The reason the STDIN/STDOUT unix filter paradigm wasn't implemented in this script has more to do with its original development than with with its final functionality. The first script was written on a Mac and I haven't found a good way to deal with STDOUT using MacPerl. Although it did make for some easily-implemented dialogue boxes :-)

Back to Code Catacombs

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://101196]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (18)
As of 2014-04-17 20:33 GMT
Find Nodes?
    Voting Booth?

    April first is:

    Results (455 votes), past polls