Dear Monks,
I am looking for some wisdom,
I have a table with 40+ columns and 10000+ rows (bioinformaticians lol)
and I have started programming in Perl few months ago.
I want to write a program for performing some calculation on tables (which are in the 99% of the cases the output from various softwares).
I need to able to : select the replicates per group, keep the replicates separated (es. cond A1/2/3 needs to be tested against Ctrl1/2/3 not against B1/2/3) and to perform the calculation element-wise (i.e t-test needs to be performed using the first element in the three replicate in A vs 3 replicate of ctrl so I thought the most convenient data structure for storing it would be AoA or HoA.
the column names are like "somethingcostant_A1" so the first part of the string is common to every bunch of column.
An example of the dataset could be found here.
https://www.dropbox.com/s/jvpqu8wlvwri8x6/proteinGroups.txt?dl=0
I thought it would be pretty cool to be able to run a program who's creating a ref table first with the name of what it needs to be taken, groups and control.
so enough bla bla let's go to the code.
use warnings;
use strict;
chdir "\.\/IN";
my @files = glob "*txt";
foreach (@files) {
my $infile = "$_";
print "$infile opened\n";
open my $fh, "<", $infile or die "can not open '$infile' $infile\n
+$!\n";
my @hh;
while (<$fh>) {
chomp;
@hh = split /\t/, $_;
last;
}
close $fh or die "can not close filehandle\n";
print "$infile closed\n";
chdir "\.\.";
print
"insert column name used for calculation\ni.e MaxQuant LFQ inten
+sity\n";
chomp( my $w = <STDIN> );
my @lfq = grep { /^$w/ } @hh;
@lfq = map { ( my $foo = $_ ) =~ s/^$w //; $foo } @lfq;
print "insert name of control without replicate number\n";
chomp( my $c = <STDIN> );
my @crl = grep { /^$c*/ } @lfq;
my @res = grep { !/^$c*/ } @lfq;
push @crl, @res;
print "insert number of groups\n";
chomp( my $g = <STDIN> );
print "insert number of replicates\n";
chomp( my $l = <STDIN> );
my $length = scalar @lfq;
if ( $length == $g * $l ) {
my @out;
my $nn = join "\t", @crl;
my @ll = ( 1 .. $g );
my @dd = ( 1 .. $l );
my @gg;
for ( 1 .. $g ) {
until ( scalar @gg == $_ * $l ) {
push @gg, $_;
}
}
my @cc;
for ( 1 .. $length ) { push @cc, $w }
my $hj = join "\t", @cc;
my $jj = join "\t", @gg;
push @out, "$hj\n";
push @out, "$jj\n";
unshift @out, "$nn\n";
my $outfile = "\.\/INFO\/order.txt";
open my $out, ">", $outfile or die "$!";
print $out @out;
print "results printed in $outfile\n";
close $out or die "can not close filehandle for printing\n$!\n
+";
}
else {
print
"wrong number of groups or replicates\ncheck column names for typos\n"
+;
}
}
and this is working no problem, then this is the major program where the calculation should be performed.
#!/Users/andreafossati/perl5/perlbrew/perls/perl-5.24.0/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $info = "\.\/INFO\/order.txt";
open my $fh, "<", $info or die "can not open 'info' txt, run order pl
+first\n$!\n";
my (@bb, %kk, $us, @sam);
my $sw = 0;
while (<$fh>) {
if (!/^[1-9]{1}/ && $sw == 0 )
{ @sam = split /\t/, $_;
$sw = 1;
next;
}
if (/^[1-9]{1}/ && ( @sam ) ) {
@bb = split /\t/, $_;
@kk{@sam} = @bb;
next;
}
my $st = $_;
my @ff = split /\t/, $st;
@ff = uniq(@ff);
$us = $ff[0];
}
close $fh or die "can not close 'info' fh\n$!";
my @m = reverse sort {$a <=> $b} values %kk;
my $gr = $m[0];
(@m, @bb) = ();
print "\n$gr groups found, column used for quantification: $us\n";
print "\ninsert filename to be processed\nneeds to be a tab delimited
+file in IN folder\n";
chomp (my $filename = <STDIN>) ;
my $in = "\.\/IN\/$filename.txt";
print "\ninsert prot identifier name\n";
chomp (my $w = <STDIN>);
open my $fh_2, "<", $in or die "can not open $filename,\n$!\n";
my (@h, @d, @id, %gg ) = ();
while (<$fh_2>) {
chomp;
my %t = ();
if (/.*$w.*/) {@h = split /\t/, $_; next;}
if ( @h ) {
@d = split /\t/, $_;
@t{@h} = @d;
my $ll = $t{$w};
push @id, $ll;
my @k;
for (keys %kk)
{
my @cm = grep { /^$us.*/ } keys(%t);
push @k, @cm;
}
@k = uniq(@k);
for (@k) {
my $key = $_;
(my $k = $key ) =~ s/^$us.//;
push @{$gg{$k}}, $t{$key};
}
} else {next;}
}
close $fh_2 or die "can not close 'filename' fh\n$!";
sub uniq {
my %seen;
grep !$seen{$_}++, @_;
}
and also this one is fine. but I am having hard times taking out the things I want from the HoA, for example for normalizing the data I need to calculate the median per array then subtract per each element the medianof the column plus the mean of the medians previously calculated.
I was thinking about using PDL and converting it to a matrix (AoA) then performing everything, but I am not so expert in Perl and after reading a lot about complex data structure I m kinda confused in how to convert from HoA to AoA and I also don't know if a matrix would be the correct structure for this sort of problem.
Or if I simply use R and whatever.
Ideally my output would be a data structure that allows to perform element-wise calculation without loosing the order of the elements (every position for every array is a different protein and needs to be compared with he same protein in the different arrays), furthermore I also need to be able to separate the 3 control from the rest.
Please note in the dataset provided I have got just six samples (3+3) usually it is at least 50 so it needs to be something really flexible. I also need to not have them hardcoded cause otherwise it will means having to change them in every different runs.
Sorry for the uber long post!