http://www.perlmonks.org?node_id=971850


in reply to Population of HoAoA based on file contents

Thanks! I rewrote my code following your advice. I used Data::Printer to check it and I think I've finally got my data into the structure that I wanted.

In the meantime, I've been trying to work out how to process my data from the HoAoA. What I want to do is pass the first interior array (as a block) for every individual to the module methods, then pass the second array in the same way, and so on for each of the array positions in turn. The goal is to treat each array position as a separate window for the purposes of analysis. Just to make it clear, here is a simplified version of my structure:

my %hash; $hash{key1} = [ [0 0, 0 1, 0 0, 1 1, 0 0, 0 1,], # first block [0 0, 0 0, 0 0, 0 1, 0 1, 0 0,], # second block [0 1, 0 0, 0 0, 0 1, 0 0, 0 0,], # third block ]; $hash{key2} = [ [0 0, 0 0, 0 0, 0 1, 1 1, 0 0,], # first block [0 0, 0 1, 0 0, 0 0, 0 0, 0 0,], # second block [0 0, 1 1, 0 0, 1 1, 0 0, 1 0,], # third block ];

In the above case, I'd want to pass to the module methods (shown in the code below) the arrays labeled '#first block' for both individuals first, then those labeled '#second block', and finally those labeled '#third block', so that I get separate statistics for each genomic region.

I have been trying to do this on my actual data. So far, I have produced the following (broken) code:

my $id; # ?? foreach my $window ( @{ $data{$id} } ) { # only pass windows with > min number of stats (i.e. >1 for now), an +d # of course we also ignore any 'undef' windows. next unless scalar @{ $data{$id} }[$window] >= 2; foreach my $individual ( keys(%data) ) { my $segsites = Bio::PopGen::Statistics->segregating_sites_count( \ +@{ $data{$id} }[$window] ); my $singletons = Bio::PopGen::Statistics->singleton_count( \@{ $da +ta{$id} }[$window] ); my $pi = Bio::PopGen::Statistics->pi( \@{ $data{$id} }[$window] ); my $theta = Bio::PopGen::Statistics->theta( \@{ $data{$id} }[$wind +ow] ); my $tajima_D = Bio::PopGen::Statistics->tajima_D( \@{ $data{$id} } +[$window] ); my $D_star = Bio::PopGen::Statistics->fu_and_li_D_star( \@{ $data{ +$id} }[$window] ); my $F_star = Bio::PopGen::Statistics->fu_and_li_F_star( \@{ $data{ +$id} }[$window] ); # output follows: say $out_file "Population: $cont\tChromosome: $chr_num"; say $out_file "Seg sites\tSingletons\tPi\tTheta\tTajima's D\tFu & +Li F*\tFu & Li D*"; # to account for invariant sites, divide stats by window size (i.e +. 100_000) my @stats = qw($segsites $singletons $pi $theta $tajima_D $F_star +$D_star); foreach my $stat (@stats) { print $out_file $stat / 100_000, "\t" } print $out_file "\n"; } }

This code produces all manner of errors, and I don't really know if my entire approach is incorrect, or if it's just that I don't understand the Perl syntax for working with complex data structures. The various pages on perldoc haven't really helped much. I'd really appreciate some hints on how to proceed with this.

Replies are listed 'Best First'.
Re^2: Population of HoAoA based on file contents
by state-o-dis-array (Hermit) on May 22, 2012 at 22:24 UTC
    I'm going to deal with just this section, I think once you understand/update what is happening here, everything else you have will work.
    my $id; # ?? foreach my $window ( @{ $data{$id} } ) { # only pass windows with > min number of stats (i.e. >1 for now), an +d # of course we also ignore any 'undef' windows. next unless scalar @{ $data{$id} }[$window] >= 2; foreach my $individual ( keys(%data) ) {
    First, I think that you understand that having an undefined $id as a hash key won't work. And to fix this, you want to just rearrange your for loops.
    foreach my $id ( keys %data ){ foreach my $window ( @{$data{$id}} ){

    Just this will get you much further. There is still a problem with the second loop though, at least in how you use $window in the subsequent code. You are attempting to use $window as an index, but it's not an index. Since you have experience using Data::Printer, I suggest going ahead and checking out what $window is.

    Now, the way to fix things with the fewest number of overall edits is to change the second for loop.

    foreach my $id ( keys %data ){ foreach my $window ( 0 .. $#{$data{$id}} ){ next unless scalar @{ $data{$id}[$window] } >= 2;
    Then in your method calls, just change those to align with the scalar test above:
    \@{ $data{$id}[$window] }
    I hope that helps.

      Thanks for your further help on this. I feel like I'm almost there, but not quite.. I made the changes you suggested, but I'm still getting errors starting at the line

      next unless scalar @{ $data{$id}[$window] } >= 2;

      The error is:

      Can't use an undefined value as an ARRAY reference

      Is this referring to $window? Regarding the way I was using it above, was $window the outer array that holds all the smaller arrays? Even if this is true, I don't see why I'm getting the error messages I'm getting.

        Hmm, based on your expectation for your data, I'm guessing what you probably need is to add:
        next unless ref( $data{$id}[$window] ) eq "ARRAY";
        just before the line getting the error.