in reply to multicolumn extraction

Extract the 2nd and 3rd column, and write them to a new file:

use strict; use warnings; open my $APRI2_in, '<', 'path_to_infile.txt' or die $!; open my $out_fh, '>', 'path_to_outfile.txt' or die $!; while( <$APRI2_in> ) { my @columns = split /\t/, $_; print $out_fh "$columns[1]\t$columns[2]\n"; } close $APRI2_in; close $out_fh or die $!;

...or as a one-liner...

perl -plaF/\t/ -e '$_ = "$F[1]\t$F[2]"' infile > outfile


Replies are listed 'Best First'.
Re^2: multicolumn extraction
by NetWallah (Canon) on Jun 03, 2012 at 18:06 UTC
    For the one-liner, the pattern for the -F flag needs to be escaped, since the shell will gobble the first "\":
    perl -apF\\t -e '$_="$F[1]\t$F[3]\n"'

                 I hope life isn't a big joke, because I don't get it.

Re^2: multicolumn extraction
by Kenosis (Priest) on Jun 03, 2012 at 16:30 UTC

    Nicely done, Dave. However, since we don't know the number of columns or the file size, do you think it would be better to limit the split to only the needed columns, as in the following?

    my @columns = (split /\t/)[1 .. 2];

    Depending on these factors and the machine, the script might otherwise choke.

    Just a thought...


    Now splitting on /\t/ based upon sauoq's good catch in his comment below.

      Depending on these factors and the machine, the script might otherwise choke.

      That's highly unlikely as the file is being handled line by line. And if there were a truly humongous line, your modification actually wouldn't be much better.

      And you've introduced a potential bug by splitting a tab delimited file on whitespace instead of on tabs.

      "My two cents aren't worth a dime.";

        Your reply makes sense, sauoq. I see that I assumed too much by the OP's field representations as not containing any spaces, so it would be best to split on the known field delimiter. Indeed, it would be disastrous if the first field contained spaces. Good catch and thank you for bringing this to my attention.


        I was curious to see whether there was any speed difference between spliting all or spliting some columns, so I ran the following which creates and splits a 20 column x 10000 row file:

        use Modern::Perl; use Benchmark qw(cmpthese); my $entry = "aaaaaaaaaaaaaaaaa"; my $columnsFile = 'columns.txt'; open my $file, ">$columnsFile" or die $!; do { print $file "$entry\t" x 19; say $file $entry } for 1 .. 10000; close $file; sub splitAll { open my $file, "<$columnsFile" or die $!; while (<$file>) { my @columns = split /\t/; } close $file; } sub splitSome { open my $file, "<$columnsFile" or die $!; while (<$file>) { my @columns = ( split /\t/ )[ 1 .. 2 ]; } close $file; } cmpthese( -5, { splitAll => sub { splitAll() }, splitSome => sub { splitSome() } } );


        Rate splitAll splitSome splitAll 19.8/s -- -21% splitSome 25.1/s 27% --

        In this case, spliting only some shows a significant speed advantage--and with this relatively small file. I ran the script many times, getting as high as 31% for splitSome and as low as 21%--but always showing that splitSome is significantly faster.