http://www.perlmonks.org?node_id=1075252


in reply to Re: How to add column into array from delimited tab file
in thread How to add column into array from delimited tab file

Hello ken, thanks for explaining in your reply. It makes abit more sense for me now!
Yes to the following: •You have multiple, tab-delimited files •The first line of each file contains column headers •Each file may have a different number of columns
However, I do want to keep the first column. I have columns that contain dataR(X) (e.g. dataR1, dataR2...dataR28) and then followed by several links (contained in several columns..some rows will be empty.) which I also want to keep So right now, my problem here is trying to find the header that match dataS0XRx so that I can grab those columns to perform some calculations:
e.g. first file.txt: ID dataS01R1 dataS01R2 dataS02R1 dataS02R2 Links M45 345.2 536 876.12 873 http://.. M34 836 893 829 83.234 M72 873 123 342.36 837 M98 452 934 1237 938 http://.. =================================================== Calculation: row2/row2, row3/row2, row4/row2...row3400/row2 row2/row3, row3/row3, row4/row3 ... row3400/row3 row2/row4, row3/row4 ...row3400/row4 E.g dataS01R1 become: ID dataS01R1 ..dataS01R02... Links M45 1 (345.2/345.2) http://.. M34 2.42 (836/345.2) M72 2.52 (873/345.2) M98 1.309 (452/345.2) http://.. M45 0.41 (345.2/836) http://.. M34 1 (836/836) M72 1.04 (873/836) M98 0.54 (452/836) http://.. . . (loop through rows as denominator) .
and then loop through the column, print it out and filter off unwanted rows based on the average Coefficient Variance across all dataSXR0X rows (which I will figure out later after I manage to figure out the beginning part). So my problem here: How to find the column headers matching dataS0XR0X to put those columns into arrays for manipulation? here is my code which I have done initially before posting into perlmonk:
if($first) { #if this is the first file, find the column locations my $firstline = <CURINFILE>; #read in the header line chomp $firstline; my @columns = split(/\t/, $firstline); my $columncount = 0; while($columncount <= $#columns && !($columns[$columncount] =~ + /ID/)) { $columncount++; } $ID= $columncount; while($columncount <= $#columns && !(($columns[$columncoun +t] =~ /_dataS(\d+)R/) )) { $columncount++; } $intensitydata = $columncount; #read in the remainder of the file while(<CURINFILE>) { #add the id, intensity values to an array chomp $_; my @templine = split(/\t/,$_); my @tempratio = (); push(@tempratio, $templine[$ID]); push(@tempratio, $templine[$intensitydata]); print "\nWriting output...";
I tried this code initially (before changing to the code I posted in first post)but it doesn't print out anything so I do not know what's went wrong. I am working on large databases and initially I worked with excel but it is too slow and lag my whole computer when performing calculations, so I decided to try PERL instead as I read that it is good for manipulating large datasets. However I am quite new to PERL, just started two months back. So I am not sure if what I am doing is okay. If there are other suggestions, let me know too. I hope my explanation is not confusing. :)

Replies are listed 'Best First'.
Re^3: How to add column into array from delimited tab file
by Kenosis (Priest) on Feb 18, 2014 at 06:36 UTC

    Have made a few modifications:

    use strict; use warnings; my ( @header, %hash ); my @files = qw/File1.txt File2.txt/; local $, = "\t"; for my $file (@files) { open my $fhIN, '<', $file or die $!; while ( my $line = <$fhIN> ) { my @columns = split ' ', $line; if ( $. == 1 ) { @header = @columns; } else { push @{ $hash{ $header[$_] } }, $columns[$_] for 0 .. $#co +lumns; } } close $fhIN; for my $key ( keys %hash ) { if ( $key =~ /^dataS\d\dR\d$/ ) { print $key, @{ $hash{$key} }, "\n"; } } undef %hash; }

    All columns are kept. After the script has processed a file's lines, it iterates through the hash keys. Note that a regex attempts to match the heading pattern for the columns you're interested in processing. Now, when there a match, it just prints the key and the associated list of values.

      I'm sorry, I do not understand this part...abit lost here:
      for my $key ( keys %hash ) { if ( $key =~ /^dataS\d\dR\d$/ ) { print $key, @{ $hash{$key} }, "\n"; }
      I want to be able manipulate the columns individually (thus attempting array) so I am not sure how hash can help in this case. I try running it but there's error which the script keep running without stopping. Instead, I attempted the matching of the colume names by using loop but I run into errors somewhere. I have my own code where I did a while loop to find the matching headers:
      my $first = 1; for(my $i = 0; $i < $originalfilecount; $i++) { #read in the current file open CURINFILE, "<$files[$i]" or die "Error couldn't open file $fi +les[$i]\n"; print "$files[$i]\n"; if($first) { #if this is first file, find column locations my $firstline = <CURINFILE>; #read headerline chomp $firstline; my @columns = split (/\t/, $firstline); my $columncount = 0; # print "$firstline\n"; #check if print headers correctly ####### Column Headers for ID, TIME ######### while ($columncount <= $#columns && !($columns[$columncount] = +~ /ID/)) { $columncount ++; } $ID = $columncount; while ($columncount <= $#columns && !($columns[$columncount] = +~ /Time/)) { $columncount ++; } $masstimes = $columncount; while ($columncount <= $#columns && !($columns[$columnco +unt] =~ /Links/)) { $columncount ++; } $Links = $columncount; #check if column position is correct (so far it is correct) print "ID is at column: $ID\n"; #output = 0 print "Time is at column: $masstimes\n"; #output = 1 print "Links is at column: $Links\n"; #output = 33 #DataR Columns (got ERROR here where I can't run script at all if I ad +d this) # while($columncount <= $#columns && !(($columns[$columncount] = +~ /_data/))) { $columncount++; } $columns[$columncount] =~ /_dataS(\d+)R/; my $currentReplicateID = $1; my $currentReplicateCount = 1; $ctrlStartCol = $columncount++; while($columncount <= $#columns) { $columns[$columncount] =~ /_dataS(\d+)R/; my $newReplicateID = $1; if($newReplicateID ne $currentReplicateID) { push(@replicateCount, $currentReplicateCount); $currentReplicateID = $newReplicateID; $currentReplicateCount = 1; } else { $currentReplicateCount++; } $columncount++; } #add the last replicate in push(@replicateCount, $currentReplicateCount); ###### End of Data Column Headers #### ####### Read remainder of the file ############## while (<CURINFILE>) { #add metabolite ID, MZ, RT to an array chomp $_; my @templine = split (/\t/, $_); push(@tempratio, $templine[$metabolite]); push(@tempratio, $templine[$masstimes]); push(@tempratio, $templine[$rt]); #ERROR #add intensities from the samples my $columnIndex = $ctrlStartCol; for(my $k = 0; $k <= $i; $k++) { $columnIndex += $replicateCount[$k]; } for(my $j = 0; $j < $replicateCount[$i+1]; $j++) { push(@tempratio, $templine[$columnIndex+$j]); } } } # end of main if loop close CURINFILE; } #end of main for loop ############## Start of output ################## print "\nWriting output..."; #create a new Directory and open output files and print out those valu +es from the hash that meet the filtering criteria #filtering criteria: defaults set at pvalue < 0.05, 0.5 <ratio > 1.5. +User specified mkdir "$pathname" or die "Error couldn't create new Directory"; open my $OUT1, ">$pathname/Metabolite ID.txt" or die "error couldn't o +pen output file"; open my $OUT2, ">$pathname/masstimes.txt" or die "error couldn't open +output file"; open my $OUT3, ">$pathname/retentiontimes.txt" or die "error couldn't +open output file"; open my $OUT4, ">$pathname/intensitydata.txt" or die "error couldn't o +pen output file"; print $OUT1 "$tempratio[0]"; print $OUT2 "$tempratio[1]"; print $OUT3 "$tempratio[2]"; print $OUT4 "$tempratio[3]"; close $OUT1; close $OUT2; close $OUT3;
      PS: The output only print 1st value for all OUT1 to OUT4 instead all the values in the whole columns...why?? my DataR names are something like this across the columns:
      8899_Neg_Rep01_dataS01R01 8889_Neg_Rep02_dataS01R02 8889_Neg_Rep03_dataS01R03 7499_Neg_Rep01_dataS02R01 7499_Neg_Rep02_dataS02R02 7499_Neg_Rep03_dataS02R03 7709_Neg_Rep01_dataS05R01 7709_Neg_Rep02_dataS05R02 7709_Neg_Rep03_dataS05R03 (and so on...)
      That's why I attempt a while loop to try match but I think somewhere is wrong? Not sure if it is the right way to do this.

        I want to be able manipulate the columns individually (thus attempting array) so I am not sure how hash can help in this case.

        In the case of using a hash of arrays (HoA), you can think of the arrays as named--although Perl has no such construct.

        For example, running the modified script that only prints the headers and values, the following, single file was used:

        ID dataS01R1 dataS01R2 dataS02R1 1 324 445 654 2 234 654 768.5 3 542.12 764 98.2

        And here the script's output:

        dataS02R1 654 768.5 98.2 dataS01R2 445 654 764 dataS01R1 324 234 542.12

        This shows the headers (the keys) followed by the list elements.

        The regex /^dataS\d\dR\d$/ matches only the column headings listed above, which have the pattern dataSnnRn, where "n" signifies a digit 0-9.

        Thus, the notation @{ $hash{$key} } represents the array of elements under the heading contained in $key.

        If you want to process a specific column, use eq instead of a regex, e.g., if ( $key eq 'dataS01R1' ) { ....

        This may now lead back to the original, unmodified script which generates a set of files for each processed table, as you can modify the script to fit your needs.

        I try running it but there's error which the script keep running without stopping.

        The script executes locally as expected, given the datasets you've shared. Without knowing more about your data, it's difficult to troubleshoot the problem you're experiencing.

        Hope this helps!