Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

xcol

by sflitman (Hermit)
on May 17, 2009 at 19:00 UTC ( #764538=sourcecode: print w/ replies, xml ) Need Help??

Category: Utility Scripts
Author/Contact Info Stephen Flitman <sflitman at xenoscience.com>
Description: A simple text-based column extractor for use in Unix pipelines
#!/usr/bin/perl
# Stephen Flitman - extract one or more columns of a table
# Released under GPLv2
# Usage: ... | xcol N M ...
# where N, M, ... are 0-based column indices, and columsn are split by
+ tabs
# if invoked without arguments, tells you what columns are present and
+ their indices, useful if there is a header row

use strict;

if (@ARGV) {
   while (<STDIN>) {
      chop;
      my @fields=split(/\t/,$_);
      for (my $i=0; $i<=$#ARGV; $i++) {
         print $fields[$ARGV[$i]];
         print "\t" if $i<$#ARGV;
      }
      print "\n";
   }
} else {
   my $line;
   until ($line=~/\t/) { $line=<STDIN>; }
   chop $line;
   die "No lines to process" unless $line;
   my $i=0;
   for my $field (split(/\t/,$line)) {
      printf "%3d: $field\n",$i++;
   }
}

exit;


Comment on xcol
Download Code
Replies are listed 'Best First'.
Re: xcol
by graff (Chancellor) on May 18, 2009 at 05:45 UTC
    How is this better than unix "cut"? I know it's different: "cut" uses 1-based column numbers instead of 0-based, and column selection requires a "-f" option flag, and when "-f" is not provided, it exits with usage instructions, instead of listing the fields on the first tab-delimited line of input. But to make it better:
    • you might consider allowing options for selecting a delimiter other than tab (like "cut" does) -- e.g. the delimiter could be a regex (which is something "cut" can't do)
    • you might also consider allowing the output field delimiter to be different from the input field delimiter (something else "cut" can't do)

    I notice that you can output fields in arbitrary orders, and even output a given column more than once, and these are handy improvements over cut. But why stop there?

    Minor nit-picks:

    • don't use "chop" ("chomp" is better, but see below)
    • given that it's only useful with piped input, die with usage instructions when STDIN is a tty (read about the "-t" function in "perldoc -f -X")
    • it might also be nice to die with instructions when @ARGV contains things that aren't digit strings
    • I like POD. Don't you?
    • given a set of digit strings in @ARGV, your process could be expressed in less code:
      while (<STDIN>) { tr/\r\n//d; # even better than chomp! print join( "\t", ( split /\t/ )[@ARGV] ), "\n"; }
    • (added as an update) BTW, this construct: until ($line=~/\t/) { $line=<STDIN>; } will be an infinite loop if the piped input never contains a tab character. I think you'll want this instead:
      while(<STDIN>) { last if /\t/; } die "No tab-delimited fields found\n" unless ( /\t/ );

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://764538]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (9)
As of 2015-07-29 01:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (260 votes), past polls