Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: How to add column into array from delimited tab file

by Kenosis (Priest)
on Feb 17, 2014 at 19:02 UTC ( [id://1075203]=note: print w/replies, xml ) Need Help??


in reply to How to add column into array from delimited tab file

The following expresses my understanding of your situation, based upon your original posting:

  • You have multiple, tab-delimited files
  • The first line of each file contains column headers
  • Each file may have a different number of columns
  • The first column of each file is the ID, so can be discarded
  • You want to generate a file for each column (beyond the ID column), for each of the tab-delimited files

If my understanding is correct, the following--which uses a hash of arrays (HoA)--provides one solution:

use strict; use warnings; my ( @header, %hash ); my @files = qw/File1.txt File2.txt/; local $, = "\n"; for my $file (@files) { open my $fhIN, '<', $file or die $!; while ( my $line = <$fhIN> ) { my @columns = split ' ', $line; if ( $. == 1 ) { @header = @columns; } else { push @{ $hash{ $header[$_] } }, $columns[$_] for 1 .. $#co +lumns; } } close $fhIN; for my $i ( 1 .. $#header ) { open my $fhOUT, '>', "$file\_$header[$i].txt" or die $!; print $fhOUT @{ $hash{ $header[$i] } }; close $fhOUT; } undef %hash; }

Each file in @files is processed. If it's the first line of the file ($. contains the current line number), then it's the header line, and is saved.

For lines 2 .. n, you'll note the following:

push @{ $hash{ $header[$_] } }, $columns[$_] for 1 .. $#columns; ^ ^ ^ ^ ^ | | | | | | | | | + - To the last + index | | | + - Starting at the +next index after the index for ID | | + - Column value | + - Column heading + - Generate a HoA, where: keys are column names; values are ref +erences to lists of column entries

The for my $i ( 1 .. $#header ) { ... iterates through all of the column headers, except the ID, using them as keys to access the array of column values.

A file is created for each column. The naming scheme is the file's name plus the column's name. (If you didn't want files created, it's within this for loop that you can operate on the columns' values.)

You may have noticed the earlier local $, = "\n"; notation. This will cause a newline to be placed between the array's elements when printed, so each element will be on its own line. To use a different format, just set the value of $, to something else, e.g., "," or "\t".

Potential issues:

  • Columns may have the same heading (although this is highly unlikely)
  • The heading may contain 'illegal' file characters. If this is possible, a substitution can be performed on them to eliminate the 'offending' characters

Hope this helps!

Replies are listed 'Best First'.
Re^2: How to add column into array from delimited tab file
by hellohello1 (Sexton) on Feb 18, 2014 at 01:56 UTC
    Hello ken, thanks for explaining in your reply. It makes abit more sense for me now!
    Yes to the following: •You have multiple, tab-delimited files •The first line of each file contains column headers •Each file may have a different number of columns
    However, I do want to keep the first column. I have columns that contain dataR(X) (e.g. dataR1, dataR2...dataR28) and then followed by several links (contained in several columns..some rows will be empty.) which I also want to keep So right now, my problem here is trying to find the header that match dataS0XRx so that I can grab those columns to perform some calculations:
    e.g. first file.txt: ID dataS01R1 dataS01R2 dataS02R1 dataS02R2 Links M45 345.2 536 876.12 873 http://.. M34 836 893 829 83.234 M72 873 123 342.36 837 M98 452 934 1237 938 http://.. =================================================== Calculation: row2/row2, row3/row2, row4/row2...row3400/row2 row2/row3, row3/row3, row4/row3 ... row3400/row3 row2/row4, row3/row4 ...row3400/row4 E.g dataS01R1 become: ID dataS01R1 ..dataS01R02... Links M45 1 (345.2/345.2) http://.. M34 2.42 (836/345.2) M72 2.52 (873/345.2) M98 1.309 (452/345.2) http://.. M45 0.41 (345.2/836) http://.. M34 1 (836/836) M72 1.04 (873/836) M98 0.54 (452/836) http://.. . . (loop through rows as denominator) .
    and then loop through the column, print it out and filter off unwanted rows based on the average Coefficient Variance across all dataSXR0X rows (which I will figure out later after I manage to figure out the beginning part). So my problem here: How to find the column headers matching dataS0XR0X to put those columns into arrays for manipulation? here is my code which I have done initially before posting into perlmonk:
    if($first) { #if this is the first file, find the column locations my $firstline = <CURINFILE>; #read in the header line chomp $firstline; my @columns = split(/\t/, $firstline); my $columncount = 0; while($columncount <= $#columns && !($columns[$columncount] =~ + /ID/)) { $columncount++; } $ID= $columncount; while($columncount <= $#columns && !(($columns[$columncoun +t] =~ /_dataS(\d+)R/) )) { $columncount++; } $intensitydata = $columncount; #read in the remainder of the file while(<CURINFILE>) { #add the id, intensity values to an array chomp $_; my @templine = split(/\t/,$_); my @tempratio = (); push(@tempratio, $templine[$ID]); push(@tempratio, $templine[$intensitydata]); print "\nWriting output...";
    I tried this code initially (before changing to the code I posted in first post)but it doesn't print out anything so I do not know what's went wrong. I am working on large databases and initially I worked with excel but it is too slow and lag my whole computer when performing calculations, so I decided to try PERL instead as I read that it is good for manipulating large datasets. However I am quite new to PERL, just started two months back. So I am not sure if what I am doing is okay. If there are other suggestions, let me know too. I hope my explanation is not confusing. :)

      Have made a few modifications:

      use strict; use warnings; my ( @header, %hash ); my @files = qw/File1.txt File2.txt/; local $, = "\t"; for my $file (@files) { open my $fhIN, '<', $file or die $!; while ( my $line = <$fhIN> ) { my @columns = split ' ', $line; if ( $. == 1 ) { @header = @columns; } else { push @{ $hash{ $header[$_] } }, $columns[$_] for 0 .. $#co +lumns; } } close $fhIN; for my $key ( keys %hash ) { if ( $key =~ /^dataS\d\dR\d$/ ) { print $key, @{ $hash{$key} }, "\n"; } } undef %hash; }

      All columns are kept. After the script has processed a file's lines, it iterates through the hash keys. Note that a regex attempts to match the heading pattern for the columns you're interested in processing. Now, when there a match, it just prints the key and the associated list of values.

        I'm sorry, I do not understand this part...abit lost here:
        for my $key ( keys %hash ) { if ( $key =~ /^dataS\d\dR\d$/ ) { print $key, @{ $hash{$key} }, "\n"; }
        I want to be able manipulate the columns individually (thus attempting array) so I am not sure how hash can help in this case. I try running it but there's error which the script keep running without stopping. Instead, I attempted the matching of the colume names by using loop but I run into errors somewhere. I have my own code where I did a while loop to find the matching headers:
        my $first = 1; for(my $i = 0; $i < $originalfilecount; $i++) { #read in the current file open CURINFILE, "<$files[$i]" or die "Error couldn't open file $fi +les[$i]\n"; print "$files[$i]\n"; if($first) { #if this is first file, find column locations my $firstline = <CURINFILE>; #read headerline chomp $firstline; my @columns = split (/\t/, $firstline); my $columncount = 0; # print "$firstline\n"; #check if print headers correctly ####### Column Headers for ID, TIME ######### while ($columncount <= $#columns && !($columns[$columncount] = +~ /ID/)) { $columncount ++; } $ID = $columncount; while ($columncount <= $#columns && !($columns[$columncount] = +~ /Time/)) { $columncount ++; } $masstimes = $columncount; while ($columncount <= $#columns && !($columns[$columnco +unt] =~ /Links/)) { $columncount ++; } $Links = $columncount; #check if column position is correct (so far it is correct) print "ID is at column: $ID\n"; #output = 0 print "Time is at column: $masstimes\n"; #output = 1 print "Links is at column: $Links\n"; #output = 33 #DataR Columns (got ERROR here where I can't run script at all if I ad +d this) # while($columncount <= $#columns && !(($columns[$columncount] = +~ /_data/))) { $columncount++; } $columns[$columncount] =~ /_dataS(\d+)R/; my $currentReplicateID = $1; my $currentReplicateCount = 1; $ctrlStartCol = $columncount++; while($columncount <= $#columns) { $columns[$columncount] =~ /_dataS(\d+)R/; my $newReplicateID = $1; if($newReplicateID ne $currentReplicateID) { push(@replicateCount, $currentReplicateCount); $currentReplicateID = $newReplicateID; $currentReplicateCount = 1; } else { $currentReplicateCount++; } $columncount++; } #add the last replicate in push(@replicateCount, $currentReplicateCount); ###### End of Data Column Headers #### ####### Read remainder of the file ############## while (<CURINFILE>) { #add metabolite ID, MZ, RT to an array chomp $_; my @templine = split (/\t/, $_); push(@tempratio, $templine[$metabolite]); push(@tempratio, $templine[$masstimes]); push(@tempratio, $templine[$rt]); #ERROR #add intensities from the samples my $columnIndex = $ctrlStartCol; for(my $k = 0; $k <= $i; $k++) { $columnIndex += $replicateCount[$k]; } for(my $j = 0; $j < $replicateCount[$i+1]; $j++) { push(@tempratio, $templine[$columnIndex+$j]); } } } # end of main if loop close CURINFILE; } #end of main for loop ############## Start of output ################## print "\nWriting output..."; #create a new Directory and open output files and print out those valu +es from the hash that meet the filtering criteria #filtering criteria: defaults set at pvalue < 0.05, 0.5 <ratio > 1.5. +User specified mkdir "$pathname" or die "Error couldn't create new Directory"; open my $OUT1, ">$pathname/Metabolite ID.txt" or die "error couldn't o +pen output file"; open my $OUT2, ">$pathname/masstimes.txt" or die "error couldn't open +output file"; open my $OUT3, ">$pathname/retentiontimes.txt" or die "error couldn't +open output file"; open my $OUT4, ">$pathname/intensitydata.txt" or die "error couldn't o +pen output file"; print $OUT1 "$tempratio[0]"; print $OUT2 "$tempratio[1]"; print $OUT3 "$tempratio[2]"; print $OUT4 "$tempratio[3]"; close $OUT1; close $OUT2; close $OUT3;
        PS: The output only print 1st value for all OUT1 to OUT4 instead all the values in the whole columns...why?? my DataR names are something like this across the columns:
        8899_Neg_Rep01_dataS01R01 8889_Neg_Rep02_dataS01R02 8889_Neg_Rep03_dataS01R03 7499_Neg_Rep01_dataS02R01 7499_Neg_Rep02_dataS02R02 7499_Neg_Rep03_dataS02R03 7709_Neg_Rep01_dataS05R01 7709_Neg_Rep02_dataS05R02 7709_Neg_Rep03_dataS05R03 (and so on...)
        That's why I attempt a while loop to try match but I think somewhere is wrong? Not sure if it is the right way to do this.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1075203]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2024-04-24 06:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found