Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Extracting Columns from line

by snape (Pilgrim)
on Oct 23, 2013 at 01:50 UTC ( #1059263=perlquestion: print w/replies, xml ) Need Help??
snape has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks

I have got 2 files where in file1 (sample.txt) it is the list of samples IDs (which is around 1000). These sample IDs are the column name in the file2 (sampleValue.txt). The file2 is a data matrix of 30000*1500. I am interested in values of all the rows in 1000 columns out of 1500 which are like 1,2,5,6,70,71,75,100,112,114 and so on. There is no pattern on the columns. So, here is what I am doing and would like to know how can I improve it. Here is my code:

use strict; use warnings; ## Variables my %sampleID; my %sampleValue; ## Opening first file open my $IN, "sample.txt" or die $!; my $header = <$IN>; while(<$IN>){ chomp $_; my @line = split('\t', $_); $sampleID{$line[0]} = 1; ## Sample ID and Pam50 prediction } close($IN); print "Total number of sample ID: ", scalar(keys %sampleID),"\n"; ## 1 +000 columns ## Sample Value Data open $IN, "sampleValue.txt" or die $!; ## Columns are sample names from file1 $header = <$IN>; my @samples = split("\t", $header); ## print "Total samples: ",scalar(@samples),"\n"; ## 1500 ## loop for all the samples ids or the columns I am interested in for(my $i = 1; $i <= $#samples; $i++){ ## bcos the first instance is c +alled header of the column 1 my $sample = $samples[$i]; $sampleValue{$sample} = $i if (exists $sampleID{$sample}); } my $col = ""; foreach my $key (keys %sampleValue){ $col = $sampleValue{$key}.",".$col; } chop($col); print $col,"\n"; ## string of all the columns I am interested in ## The reason I do the above loop because I don't want to look for the + interested ## cols thru the hash for every line of the file ## Reading the sample Value file while(<$IN>){ chomp $_; print $_,"\n"; my @line = split("\t", $_); @line = @line[ split /,/, $col]; ## previously it was @line = @lin +e[$col] -- and i was getting error because $col is a string print @line,"\n"; }

So, my question is whether there is an easy way to convert the string $col to numeric cols with commas in it or a better way to get the desired columns ?

UPDATE: I have updated my code. May me it will be of some use to people later.

Replies are listed 'Best First'.
Re: Extracting columns from line
by Athanasius (Chancellor) on Oct 23, 2013 at 02:54 UTC

    I don’t understand this part of the code:

    my $col = ""; foreach my $key (keys %sampleValue){ my @col1 = split("\t",$sampleValue{$key}); $col = $col1[1].",".$col; } chop($col); print $col,"\n"; ## string of all the columns I am interested in

    So far as I can see, %sampleValue contains entries of the form Header1 => 1, Header2 => 2, Header5 => 5, ..., so splitting each value on tabs does nothing? In any case, you need to produce an array here, not a string:

    my @cols = sort values %sampleValue;

    Then you can use an array slice:

    ## Reading the sample Value file while (<$IN>) { chomp; print $_, "\n"; my @line = split "\t"; @line = @line[@cols]; # <-- Use array slice here ...

    Some notes:

    • Always use strict; and use warnings;.
    • It would help if you provided sample data and expected output along with the code.
    • Update: Just checking: you do realise that in Perl (as in C), array indices start at 0, not 1?

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I always use strict and warnings. I just showed the relevant part of the code. Sorry my bad. I have updated the code. Give me some time, I will provide the sample data and answer.

Re: Extracting Columns from line
by kcott (Chancellor) on Oct 23, 2013 at 04:41 UTC

    G'day snape,

    Given this extract from your posted code:

    print $col,"\n"; ## string of all the columns I am interested in ... @line = @line[$col]; ## I am getting the error since, # $line is a string and not numeric.. # It works if you do # @line[1,2,5,6,70,71,75,100,112,114]

    I suspect you haven't fully understood array slices. They're explained in perldata: Slices.

    This piece of code (which borrows from your variable names), should explain what I think you need:

    #!/usr/bin/env perl -l use strict; use warnings; my @line = qw{A B C D}; print "All line elements: @line"; my $col = '2,3'; print "Indices I want: $col"; my @indices = split /,/ => $col; my @wanted_elements = @line[@indices]; print "Wanted elements: @wanted_elements";


    All line elements: A B C D Indices I want: 2,3 Wanted elements: C D

    -- Ken

      Curse you, Ken! You posted (what could be a better solution) while I was working up my sample! :)


      The answer to the question "Can we do this?" is always an emphatic "Yes!" Just give me enough time and money.

      awesome !! you guys rock ..

Re: Extracting Columns from line
by boftx (Deacon) on Oct 23, 2013 at 04:50 UTC

    I'm pretty sure this is doing in essence what you want. I would not be surprised if there are some gotchas to look out for using eval like this, and I have left the warning message from the execution in place, but it might give you something to think about.

    $ cat #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @foo = ( qw/this is the sample line/ ); my @bar = ( 1,2,4 ); print Dumper(\@foo); my $barstr = join(',',@bar); print "barstr: $barstr\n"; my @line = @foo[@bar]; print "@line\n";; my @line2 = @foo[ eval $barstr ]; print "@line2\n"; exit; __END__ $ ./ Scalar value @foo[ eval $barstr ] better written as $foo[ eval $barstr + ] at ./ line 18. $VAR1 = [ 'this', 'is', 'the', 'sample', 'line' ]; barstr: 1,2,4 is the line is the line

    Update: sanitized my command line details.

    The answer to the question "Can we do this?" is always an emphatic "Yes!" Just give me enough time and money.
Re: Extracting Columns from line
by hdb (Monsignor) on Oct 23, 2013 at 07:35 UTC

    You can use map and grep to good effect here:

    use strict; use warnings; open my $IN, "sample.txt" or die $!; my $header = <$IN>; my %sampleID = map { /(.*?)\t/; $1 => 1 } <$IN>; # store desired colum +ns close($IN); open $IN, "sampleValue.txt" or die $!; $header = <$IN>; my @samples = split /\t/, $header; my @cols = grep { exists $sampleID{$samples[$_]} } 0..$#samples; # sto +re indices of desired columns while(<$IN>){ chomp; my @line = (split /\t/)[@cols]; # pick desired columns using array + slice print join( "\t", @line ), "\n"; }
Re: Extracting Columns from line
by Lennotoecom (Pilgrim) on Oct 23, 2013 at 05:55 UTC
    as far as I got the task
    if file1:
    A crap B crap C crap D crap E crap
    and file2:
    A crap B C crap D crap crap E 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29
    open IN, "<file1" or die $!; while(<IN>){ $sID{$1} = 1 if /^(\w+)\t/; } close IN; open IN, "<file2" or die $!; map {$i++; push @cols, $i-1 if exists $sID{$_}} split(/\t|$/, <IN>); while(@a = split /\t|$/, <IN>){ print join "\t", @a[@cols],"\n"; } close IN;
    would give you
    1 3 4 6 9 11 13 14 16 19 21 23 24 26 29

      Hello Lennotoecom

      Good effort, but a couple of things, most importantly, using the mode in the second argument of 2 argument open is right out.
      Use the 3 arg open

      open IN, '<', 'filepath' or die '$!';

      While there is much temptation to predict knowledge of data not provided, don't.

      Within file1 all of the tab delimited headers are required. There is no 'crap' in file1. This means that the hash read in from file1 is the combination of keys needed for the slice in file2.

      So you do not need to re-write input from file1. Also your map in file2 read, being used to extract the keys for the slice, is uneccessary. Well the part which increments the numbers for the keys. The required column headers already exist as the hash keys read in from file1. As for actually filing the data into a hash, (the rest of the map) you may be on the right lines.

      your split appears malformed, final /? may also be unneccessary as split operator defaults on whitespace including newline, and also defaults the special variable $_

      so to read in the cols for use against file2 something like

      #!/usr/bin/perl use warnings; use strict; my $wantcols, '<', '/path/to/file1' or die '$!'; my @cols; while(<$wantcols>){ push @cols, split; } close $wantcols;

      Then to extract through file2, just read through and load the rows

      my @valarray = [ @cols ]; #construct wanted headers my $datavalues, '<', '/path/to/file2' or die '$!'; # push rows of wanted columns onto table while(<$datavalues>){ push @valarray, [ (split)[@cols] ]; } close $datavalues;

      using split (acting on default) within an array constructer allows us to treat the split lines as an array slice so we can just load the rows as array references within the valarray. This also assumes split operates on the special variable at this level of misanthropy.

      hmm, ok, surprised myself here. however important to note, i have made the assumption that the column headers are numbers, but this is mentioned in op post. And somehow relieved the requirement for using hashes at all. Which is fine until a specific data element needs fetching. But for this you can just construct a hash or call an array element as needed.

      print the table of extracted columns you now need to print out the array of referenced arrays.

      print map { @{ $valarray[$_] } , $/ } 0..$#valarray;

      by the stars, i hope that compiles! the important lessons here though are do not make up data, and definetely do not proceed with opens where the arguments are on no account less than 3. (unless your long robe is white, with a fair weight of gold trim exquisitely sewn by the handmaidens of ash'kabha)

      edit - s/&lt;/</; re monk advised (tx)

        most importantly, using the mode in the second argument of 2 argument open is right out.
        Why? Does the code run faster with the 3 argument open than tha 2 argument open?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1059263]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2018-05-27 03:22 GMT
Find Nodes?
    Voting Booth?