Re: How to add column into array from delimited tab file
by NetWallah (Canon) on Feb 17, 2014 at 04:05 UTC
|
You have the right idea, in collecting data into your (hopefully declared earlier) arrays : @data1, @data2 etc.
The problem area is when you open and write to the output file WITHIN the read loop for each row.
You need to move the output file open and Write OUTSIDE the file read loop, something like this:
# Outside the read loop, after closing input file....
mkdir "$pathname" or die "Error couldn't create new Directory";
open my $OUT1, ">", "$pathname/column.txt" or die "error couldn't open
+ output file";
print $OUT1 "$_\n" for @data1;
close $OUT1;
(Made some adjustments to localize the file handle, and use the "3 argument" open - see other writeups to understand why this is a good idea.
What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
-Larry Wall, 1992
| [reply] [d/l] |
Re: How to add column into array from delimited tab file
by kcott (Archbishop) on Feb 17, 2014 at 04:13 UTC
|
print OUT1 "$data[1]"
is printing the second element of the array @data which, from your description, contains the value dataR1.
I expect what you want is
print OUT1 "@data1"
| [reply] [d/l] [select] |
Re: How to add column into array from delimited tab file
by Kenosis (Priest) on Feb 17, 2014 at 19:02 UTC
|
The following expresses my understanding of your situation, based upon your original posting:
- You have multiple, tab-delimited files
- The first line of each file contains column headers
- Each file may have a different number of columns
- The first column of each file is the ID, so can be discarded
- You want to generate a file for each column (beyond the ID column), for each of the tab-delimited files
If my understanding is correct, the following--which uses a hash of arrays (HoA)--provides one solution:
use strict;
use warnings;
my ( @header, %hash );
my @files = qw/File1.txt File2.txt/;
local $, = "\n";
for my $file (@files) {
open my $fhIN, '<', $file or die $!;
while ( my $line = <$fhIN> ) {
my @columns = split ' ', $line;
if ( $. == 1 ) {
@header = @columns;
}
else {
push @{ $hash{ $header[$_] } }, $columns[$_] for 1 .. $#co
+lumns;
}
}
close $fhIN;
for my $i ( 1 .. $#header ) {
open my $fhOUT, '>', "$file\_$header[$i].txt" or die $!;
print $fhOUT @{ $hash{ $header[$i] } };
close $fhOUT;
}
undef %hash;
}
| [reply] [d/l] [select] |
|
Hello ken, thanks for explaining in your reply. It makes abit more sense for me now!
Yes to the following:
•You have multiple, tab-delimited files
•The first line of each file contains column headers
•Each file may have a different number of columns
However, I do want to keep the first column. I have columns that contain dataR(X) (e.g. dataR1, dataR2...dataR28) and then followed by several links (contained in several columns..some rows will be empty.) which I also want to keep
So right now, my problem here is trying to find the header that match dataS0XRx so that I can grab those columns to perform some calculations:
e.g.
first file.txt:
ID dataS01R1 dataS01R2 dataS02R1 dataS02R2 Links
M45 345.2 536 876.12 873 http://..
M34 836 893 829 83.234
M72 873 123 342.36 837
M98 452 934 1237 938 http://..
===================================================
Calculation:
row2/row2, row3/row2, row4/row2...row3400/row2
row2/row3, row3/row3, row4/row3 ... row3400/row3
row2/row4, row3/row4 ...row3400/row4
E.g dataS01R1
become:
ID dataS01R1 ..dataS01R02... Links
M45 1 (345.2/345.2) http://..
M34 2.42 (836/345.2)
M72 2.52 (873/345.2)
M98 1.309 (452/345.2) http://..
M45 0.41 (345.2/836) http://..
M34 1 (836/836)
M72 1.04 (873/836)
M98 0.54 (452/836) http://..
.
. (loop through rows as denominator)
.
and then loop through the column, print it out and filter off unwanted rows based on the average Coefficient Variance across all dataSXR0X rows (which I will figure out later after I manage to figure out the beginning part).
So my problem here:
How to find the column headers matching dataS0XR0X to put those columns into arrays for manipulation?
here is my code which I have done initially before posting into perlmonk:
if($first)
{
#if this is the first file, find the column locations
my $firstline = <CURINFILE>; #read in the header line
chomp $firstline;
my @columns = split(/\t/, $firstline);
my $columncount = 0;
while($columncount <= $#columns && !($columns[$columncount] =~
+ /ID/))
{
$columncount++;
}
$ID= $columncount;
while($columncount <= $#columns && !(($columns[$columncoun
+t] =~ /_dataS(\d+)R/) ))
{
$columncount++;
}
$intensitydata = $columncount;
#read in the remainder of the file
while(<CURINFILE>)
{
#add the id, intensity values to an array
chomp $_;
my @templine = split(/\t/,$_);
my @tempratio = ();
push(@tempratio, $templine[$ID]);
push(@tempratio, $templine[$intensitydata]);
print "\nWriting output...";
I tried this code initially (before changing to the code I posted in first post)but it doesn't print out anything so I do not know what's went wrong.
I am working on large databases and initially I worked with excel but it is too slow and lag my whole computer when performing calculations, so I decided to try PERL instead as I read that it is good for manipulating large datasets. However I am quite new to PERL, just started two months back. So I am not sure if what I am doing is okay. If there are other suggestions, let me know too.
I hope my explanation is not confusing. :) | [reply] [d/l] [select] |
|
use strict;
use warnings;
my ( @header, %hash );
my @files = qw/File1.txt File2.txt/;
local $, = "\t";
for my $file (@files) {
open my $fhIN, '<', $file or die $!;
while ( my $line = <$fhIN> ) {
my @columns = split ' ', $line;
if ( $. == 1 ) {
@header = @columns;
}
else {
push @{ $hash{ $header[$_] } }, $columns[$_] for 0 .. $#co
+lumns;
}
}
close $fhIN;
for my $key ( keys %hash ) {
if ( $key =~ /^dataS\d\dR\d$/ ) {
print $key, @{ $hash{$key} }, "\n";
}
}
undef %hash;
}
All columns are kept. After the script has processed a file's lines, it iterates through the hash keys. Note that a regex attempts to match the heading pattern for the columns you're interested in processing. Now, when there a match, it just prints the key and the associated list of values. | [reply] [d/l] |
|
|
|
Re: How to add column into array from delimited tab file
by hellohello1 (Sexton) on Feb 17, 2014 at 06:23 UTC
|
Hi,
I have tried putting print $OUT1 "$_\n" for @data1; as what netwallah suggested. It is able to print the output I want. But in the command line, it shows:
Processing data...
Original.txt
Writing output...Use of uninitialized value $metabolite[44] in string
+at
G:\Metabolomics\Programming\PERL\Ratio Test\ratio test (update
+d 12 feb 1
4).pl line 101, <CURINFILE> line 1 (#1)
(W uninitialized) An undefined value was used as if it were alread
+y
defined. It was interpreted as a "" or a 0, but maybe it was a mi
+stake.
To suppress this warning assign a defined value to your variables.
To help you figure out what was undefined, perl will try to tell y
+ou the
name of the variable (if any) that was undefined. In some cases it
+ cannot
do this, so it also tells you what operation you used the undefine
+d value
in. Note, however, that perl optimizes your program and the opera
+tion
displayed in the warning may not necessarily appear literally in y
+our
program. For example, "that $foo" is usually optimized into "that
+ "
. $foo, and the warning will refer to the concatenation (.) operat
+or,
even though there is no . in your program.
Writing output...Uncaught exception from user code:
Error couldn't create new Directory at G:\Metabolomics\Program
+ming\PERL\
Ratio Test\ratio test (updated 12 feb 14).pl line 98, <CURINFILE> line
+ 2.
at G:\Metabolomics\Programming\PERL\Ratio Test\ratio test (updated 12
+ feb 14).p
l line 98
Press any key to continue . . .
is it some kind of error?
In addition, how do I actually loop from dataR1 to dataR3 to put into different arrays instead of having to type out manually as different files has different number of dataRx.
I know it has something to do with this line: =~ /_data(\d+)R/ somewhere in the code but I have no idea how.
push(@data1, $columns[1]);
push(@data2, $columns[2]);
push(@data3, $columns[3]);
into something like this after finding the /_dataS(\d+)R/ in the column headers:
push (@data[j], $columns[j]);
I appreciate if there is any link related to that that can push me to the right direction. :) Thanks for the help by the way!
| [reply] [d/l] [select] |
|
The structure you seem to be looking for is a 2-dimensional array - which, in perl, is an array of arrays:
my @aoa; # Array of arrays (2-d arrray)
open my $CURINFILE, "<", $files[$i]" or die "Error couldn't open file
+$files[$i]\n";
print "$files[$i]\n";
while(<$CURINFILE>)
{
chomp $_;
push @aoa, [ split('\t')]; # Insert an array ref into the array
+(which is what makes it 2-D)
}
close $CURINFILE;
print "\nWriting output...";
#The first row of @aoa contains the titles, so skip that, and print th
+e rest....
for my $row (@aoa[1..$#aoa]){ # That is a slice of the array, from in
+dex 1 till the end
print $row->[0]."\n"; # $row->[0] contains the contents of the fir
+st column (ID)
# Similarly, $row->[1] is the dataR1 column
}
What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
-Larry Wall, 1992
| [reply] [d/l] |
|
Ok. when I put the $row into arrays, it doesn't print out anything?
for my $row (@aoa[1..$#aoa])
{
@ID = $row->[0];
@data = $row->[1..$#aoa];
}
print $OUT1 "$_\n" for @ID;
print $OUT2 "$_\n" for @data;
the reason for array is so that I can calculate each value over the respective value in same array in a loop.
| [reply] [d/l] |
|