Statistics Data Structure Hash of Arrays, Arrays of Array

Serial_ has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am looking for some wisdom, I have a table with 40+ columns and 10000+ rows (bioinformaticians lol) and I have started programming in Perl few months ago. I want to write a program for performing some calculation on tables (which are in the 99% of the cases the output from various softwares). I need to able to : select the replicates per group, keep the replicates separated (es. cond A1/2/3 needs to be tested against Ctrl1/2/3 not against B1/2/3) and to perform the calculation element-wise (i.e t-test needs to be performed using the first element in the three replicate in A vs 3 replicate of ctrl so I thought the most convenient data structure for storing it would be AoA or HoA. the column names are like "somethingcostant_A1" so the first part of the string is common to every bunch of column. An example of the dataset could be found here.

https://www.dropbox.com/s/jvpqu8wlvwri8x6/proteinGroups.txt?dl=0

I thought it would be pretty cool to be able to run a program who's creating a ref table first with the name of what it needs to be taken, groups and control. so enough bla bla let's go to the code.

  
use warnings;
use strict;

chdir "\.\/IN";

my @files = glob "*txt";

foreach (@files) {
    my $infile = "$_";
    print "$infile opened\n";
    open my $fh, "<", $infile or die "can not open '$infile' $infile\n
+$!\n";

    my @hh;
    while (<$fh>) {
        chomp;
        @hh = split /\t/, $_;
        last;
    }

    close $fh or die "can not close filehandle\n";
    print "$infile closed\n";

    chdir "\.\.";

    print
      "insert column name used for calculation\ni.e MaxQuant LFQ inten
+sity\n";
    chomp( my $w = <STDIN> );

    my @lfq = grep { /^$w/ } @hh;
    @lfq = map { ( my $foo = $_ ) =~ s/^$w //; $foo } @lfq;

    print "insert name of control without replicate number\n";
    chomp( my $c = <STDIN> );

    my @crl = grep { /^$c*/ } @lfq;
    my @res = grep { !/^$c*/ } @lfq;
    push @crl, @res;

    print "insert number of groups\n";
    chomp( my $g = <STDIN> );

    print "insert number of replicates\n";
    chomp( my $l = <STDIN> );

    my $length = scalar @lfq;
    if ( $length == $g * $l ) {
        my @out;
        my $nn = join "\t", @crl;
        my @ll = ( 1 .. $g );
        my @dd = ( 1 .. $l );
        my @gg;
        for ( 1 .. $g ) {
            until ( scalar @gg == $_ * $l ) {
                push @gg, $_;
            }
        }
        my @cc;
        for ( 1 .. $length ) { push @cc, $w }

        my $hj = join "\t", @cc;
        my $jj = join "\t", @gg;
        push @out, "$hj\n";
        push @out, "$jj\n";
        unshift @out, "$nn\n";

        my $outfile = "\.\/INFO\/order.txt";
        open my $out, ">", $outfile or die "$!";
        print $out @out;
        print "results printed in $outfile\n";
        close $out or die "can not close filehandle for printing\n$!\n
+";
    }
    else {
        print
"wrong number of groups or replicates\ncheck column names for typos\n"
+;
    }

}
[download]

and this is working no problem, then this is the major program where the calculation should be performed.

 #!/Users/andreafossati/perl5/perlbrew/perls/perl-5.24.0/bin/perl 
use warnings;
use strict;
use Data::Dumper;


my $info = "\.\/INFO\/order.txt";

open my $fh, "<", $info or die "can not open 'info' txt, run order pl 
+first\n$!\n";

my (@bb, %kk, $us, @sam);

my $sw = 0;
while (<$fh>)  {

  if (!/^[1-9]{1}/ && $sw == 0 ) 
      { @sam = split /\t/, $_;
        $sw = 1; 
        next;
        }
    
    if (/^[1-9]{1}/ && ( @sam ) ) {
        @bb = split /\t/, $_;
        @kk{@sam} = @bb;
        next;
        }
   my $st = $_;
   my @ff = split /\t/, $st;
   @ff    = uniq(@ff);
   $us    = $ff[0];
}
close $fh or die "can not close 'info' fh\n$!";

my @m      = reverse sort {$a <=> $b} values %kk;
my $gr     = $m[0];
(@m, @bb)  = ();

print "\n$gr groups found, column used for quantification: $us\n";


print "\ninsert filename to be processed\nneeds to be a tab delimited 
+file in IN folder\n";
chomp (my $filename = <STDIN>) ;

my $in = "\.\/IN\/$filename.txt";

print "\ninsert prot identifier name\n";
chomp (my $w = <STDIN>);

 
open my $fh_2, "<", $in or die "can not open $filename,\n$!\n";

my (@h, @d, @id, %gg ) = ();

while (<$fh_2>) {
chomp;
my  %t = ();
if (/.*$w.*/) {@h = split /\t/, $_; next;}

if ( @h ) { 
    @d     = split /\t/, $_; 
    @t{@h} = @d;
    my $ll = $t{$w};
    push @id, $ll;   
    my @k;
    for (keys %kk)
    {
      my @cm = grep { /^$us.*/ } keys(%t); 
      push @k, @cm;
    }
    @k = uniq(@k);

    for (@k) { 
       my $key  = $_;
       (my $k   = $key ) =~ s/^$us.//;
       push @{$gg{$k}}, $t{$key};  
    }
  } else {next;}
}

close $fh_2 or die "can not close 'filename' fh\n$!";



sub uniq {
    my %seen;
    grep !$seen{$_}++, @_;
}
[download]

and also this one is fine. but I am having hard times taking out the things I want from the HoA, for example for normalizing the data I need to calculate the median per array then subtract per each element the medianof the column plus the mean of the medians previously calculated. I was thinking about using PDL and converting it to a matrix (AoA) then performing everything, but I am not so expert in Perl and after reading a lot about complex data structure I m kinda confused in how to convert from HoA to AoA and I also don't know if a matrix would be the correct structure for this sort of problem. Or if I simply use R and whatever. Ideally my output would be a data structure that allows to perform element-wise calculation without loosing the order of the elements (every position for every array is a different protein and needs to be compared with he same protein in the different arrays), furthermore I also need to be able to separate the 3 control from the rest. Please note in the dataset provided I have got just six samples (3+3) usually it is at least 50 so it needs to be something really flexible. I also need to not have them hardcoded cause otherwise it will means having to change them in every different runs. Sorry for the uber long post!

Comment on Statistics Data Structure Hash of Arrays, Arrays of Array Select or Download Code

Replies are listed 'Best First'.
Re: Statistics Data Structure Hash of Arrays, Arrays of Array by BrowserUk (Patriarch) on Nov 10, 2016 at 11:05 UTC
I have a table with 40+ columns and 10000+ rows (bioinformaticians lol) and I have started programming in Perl few months ago. I want to write a program for performing some calculation on tables (which are in the 99% of the cases the output from various softwares). I need to able to : select the replicates per group, keep the replicates separated (es. cond A1/2/3 needs to be tested against Ctrl1/2/3 not against B1/2/3) and to perform the calculation element-wise (i.e t-test needs to be performed using the first element in the three replicate in A vs 3 replicate of ctrl so I thought the best convenient data structure would be AoA or HoA. the column names are like "somethingcostant_A1" so the first part of the string is common to every bunch of column. ... {blah blah} ... Please, don't describe your data; show your data! Ie. Post a small sample of the raw input; and the expected output from that sample input. Anything else leaves us trying to reverse engineer your words -- which are always ambiguous -- and/or your code -- which is wrong, else you wouldn't be posting. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Statistics Data Structure Hash of Arrays, Arrays of Array by Serial_ (Novice) on Nov 10, 2016 at 12:32 UTC
sorry, first question, I am editing my question	[reply]
Re^3: Statistics Data Structure Hash of Arrays, Arrays of Array by AnomalousMonk (Archbishop) on Nov 10, 2016 at 14:36 UTC
... I am editing my question Please see How do I change/delete my post? for site etiquette and protocol regarding such changes. Update: Please also see the Short, Self Contained, Correct (Compilable), Example discussion. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^4: Statistics Data Structure Hash of Arrays, Arrays of Array by Anonymous Monk on Nov 10, 2016 at 23:25 UTC
Re^5: Statistics Data Structure Hash of Arrays, Arrays of Array by AnomalousMonk (Archbishop) on Nov 11, 2016 at 05:05 UTC
Re: Statistics Data Structure Hash of Arrays, Arrays of Array by BrowserUk (Patriarch) on Nov 11, 2016 at 00:51 UTC
Another quick question: What is your purpose for this code: `my @crl = grep { /^$c/ } @lfq; my @res = grep { !/^$c/ } @lfq; push @crl, @res;` [download] You carefully filter all the records from @lfq that start with $c into @crl; Then you equally carefully extract all the records from @lfq that don't start with $c into @res; Then you put all those carefully selected records back together in the same array. Effectively the result of those 3 expensive lines is the same as: `@crl = @lfq;` apart from the side effect of retaining some records in `@res` that could have been obtains much more efficiently by omitting 2 of the 3 lines!? It might be better if instead of showing how you are doing things; you told use what you are trying to do in terms of inputs -- files and console -- and expected output. At the moment it is hard to know where to start to help you as the code you've posted doesn't make a whole lot of sense and there doesn't appear to be a question. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Statistics Data Structure Hash of Arrays, Arrays of Array by etj (Deacon) on May 19, 2022 at 16:16 UTC
It would reorder @lfq so that all the $c-beginning entries go first. Not at all equivalent to a simple copy.	[reply]