Population of HoAoA based on file contents

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Population of HoAoA based on file contents by state-o-dis-array (Hermit) on May 09, 2012 at 20:57 UTC
So, if I understand correctly, you want your results to look something like: `$data{"HG00553"}[162][16287215] = "0 0" $data{"HG00553"}[162][16287226] = "0 0" ... $data{"HG00638"}[162][16287851] = "1 1" ...` [download] First, I'd suggest you do a little debug work to understand what your code is doing. Specifically: `while (<$in_file>) { print "$_"; chomp; my @snp_bins; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; push @individuals, $id; foreach my $individual (@individuals) { print "working on $individual\n"; foreach my $snp (@snps) { my $snp_bin = shift @snp_bins; print "snp_bin $snp_bin\n"; $data{$individual}[ [ $snp_bin ] ] = $snp; } } print "snp_bins: @snp_bins\n"; } }` [download] I think that you'll find some things which you didn't expect. Also, I think that you might be better off using HoHoH instead. I believe that doing: `$data{$individual}[162] = "something"` [download] will result in: `$data{$individual}[0..162]` [download] even though you are only concerned with the former. So it would probably be better to: `$data{$individual}{162}{16287215} = "0 0"` [download]	[reply] [d/l] [select]
Re: Population of HoAoA based on file contents by aaron_baugher (Curate) on May 09, 2012 at 22:10 UTC
One quick note: since you only have one header line, and you know it's the first line, there's no need to check for it every time through your loop. Read the header line, process it, and then start your loop on the second line. I'll assume that you really do want a HoAoA, despite the large empty areas it may have within it. In that case, I'd start by reading that header line and making an array of two-element arrays, the first element containing the first part (here 162), and the second element containing the remainder (87215 for the first one). Then as you go through the rest of the lines, you can easily step through this array to get the indexes you'll need for your HoAoA. When you start having indexes within indexes, that's when it gets mind-stretching, but fun. Something like this: #!/usr/bin/env perl use Modern::Perl; my %data; # HoAoA holding final data structure my @indexes; # array of indexes into %data based on coord +inates my $header_line = <DATA>; # some of these steps could be combined; chomp $header_line; # including them for clarity my @headers = split ',', $header_line; shift @headers; # throw away SAMPLE for my $n (@headers){ my $x = $n/100_000; my $y = $n % 100_000; push @indexes, [ $x, $y ]; } while(<DATA>){ chomp; my( $id, @bits ) = split ','; for my $i (0..@bits-1){ $data{$id}[$indexes[$i][0]][$indexes[$i][1]] = $bits[$i]; } } say "Expecting 0 0: ", $data{HG00553}[162][87365]; say "Expecting 1 1: ", $data{HG00638}[162][87851]; __DATA__ SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0 [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l]
Re: Population of HoAoA based on file contents by iangibson (Scribe) on May 14, 2012 at 19:29 UTC
Okay, I think I've made some progress, but I'm still not quite there yet. Here's what I now have: #!/usr/bin/perl use warnings; use strict; use v5.14; use Getopt::Long; use Bio::PopGen::IO; use Bio::PopGen::Statistics; die "need two arguments (i.e. chr cont) at invocation" unless @ARGV == + 2; chomp( my $chr_num = shift ); chomp( my $cont = shift ); open my $out_file, ">", "chr${chr_num}_exome_snps_processed_${cont}_ST +ATS" or die "Can't open output file: $!\n"; open my $in_file, "<", "chr${chr_num}_exome_snps_processed_$cont" or die "Can't open input file: $!\n"; my %data; my @snp_bins; my @individuals; my @all_snps; while (<$in_file>) { chomp; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; push @individuals, $id; push @all_snps[$. - 2], join(',', @snps); } } foreach my $individual (@individuals) { foreach my $index ( 0 .. $#snp_bins ) { push( @{ $data{$individual}[ $snp_bins[$index] ] }, $all_snps[ +$index] ); } } close $in_file; [download] But there's still (at least) a problem with the line `push @all_snps[$. - 2], join(',', @snps);` I hope I'm otherwise headed in the right direction..? In regard to what I will do with undefined bins: I will iterate through all the bins, and any that don't have a minimum number of elements simply won't be passed as data to the bioperl popgen stats methods, later on in the program.	[reply] [d/l] [select]
Re^2: Population of HoAoA based on file contents by state-o-dis-array (Hermit) on May 15, 2012 at 14:46 UTC
It seems to me that this would do what you are trying to accomplish. `while (<$in_file>) { chomp; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; #need to check here that $#snps == $#snp_bins ? foreach my $index ( 0 .. $#snp_bins ) { push( @{ $data{$id}[ $snp_bins[$index] ] }, $snps[$index] ); } } }` [download] Unless you need them elsewhere, I don't see a reason in your code snippent to store @individuals and @all_snps. The "need to check" comment above is based on my attempt to understand your last statement above. If it's possible that @snps might not have an entry for each @snp_bins, then you'll want this check.	[reply] [d/l]
Re: Population of HoAoA based on file contents by iangibson (Scribe) on May 22, 2012 at 19:02 UTC
Thanks! I rewrote my code following your advice. I used Data::Printer to check it and I think I've finally got my data into the structure that I wanted. In the meantime, I've been trying to work out how to process my data from the HoAoA. What I want to do is pass the first interior array (as a block) for every individual to the module methods, then pass the second array in the same way, and so on for each of the array positions in turn. The goal is to treat each array position as a separate window for the purposes of analysis. Just to make it clear, here is a simplified version of my structure: `my %hash; $hash{key1} = [ [0 0, 0 1, 0 0, 1 1, 0 0, 0 1,], # first block [0 0, 0 0, 0 0, 0 1, 0 1, 0 0,], # second block [0 1, 0 0, 0 0, 0 1, 0 0, 0 0,], # third block ]; $hash{key2} = [ [0 0, 0 0, 0 0, 0 1, 1 1, 0 0,], # first block [0 0, 0 1, 0 0, 0 0, 0 0, 0 0,], # second block [0 0, 1 1, 0 0, 1 1, 0 0, 1 0,], # third block ];` [download] In the above case, I'd want to pass to the module methods (shown in the code below) the arrays labeled '#first block' for both individuals first, then those labeled '#second block', and finally those labeled '#third block', so that I get separate statistics for each genomic region. I have been trying to do this on my actual data. So far, I have produced the following (broken) code: my $id; # ?? foreach my $window ( @{ $data{$id} } ) { # only pass windows with > min number of stats (i.e. >1 for now), an +d # of course we also ignore any 'undef' windows. next unless scalar @{ $data{$id} }[$window] >= 2; foreach my $individual ( keys(%data) ) { my $segsites = Bio::PopGen::Statistics->segregating_sites_count( \ +@{ $data{$id} }[$window] ); my $singletons = Bio::PopGen::Statistics->singleton_count( \@{ $da +ta{$id} }[$window] ); my $pi = Bio::PopGen::Statistics->pi( \@{ $data{$id} }[$window] ); my $theta = Bio::PopGen::Statistics->theta( \@{ $data{$id} }[$wind +ow] ); my $tajima_D = Bio::PopGen::Statistics->tajima_D( \@{ $data{$id} } +[$window] ); my $D_star = Bio::PopGen::Statistics->fu_and_li_D_star( \@{ $data{ +$id} }[$window] ); my $F_star = Bio::PopGen::Statistics->fu_and_li_F_star( \@{ $data{ +$id} }[$window] ); # output follows: say $out_file "Population: $cont\tChromosome: $chr_num"; say $out_file "Seg sites\tSingletons\tPi\tTheta\tTajima's D\tFu & +Li F\tFu & Li D"; # to account for invariant sites, divide stats by window size (i.e +. 100_000) my @stats = qw($segsites $singletons $pi $theta $tajima_D $F_star +$D_star); foreach my $stat (@stats) { print $out_file $stat / 100_000, "\t" } print $out_file "\n"; } } [download] This code produces all manner of errors, and I don't really know if my entire approach is incorrect, or if it's just that I don't understand the Perl syntax for working with complex data structures. The various pages on perldoc haven't really helped much. I'd really appreciate some hints on how to proceed with this.	[reply] [d/l] [select]
Re^2: Population of HoAoA based on file contents by state-o-dis-array (Hermit) on May 22, 2012 at 22:24 UTC
I'm going to deal with just this section, I think once you understand/update what is happening here, everything else you have will work. `my $id; # ?? foreach my $window ( @{ $data{$id} } ) { # only pass windows with > min number of stats (i.e. >1 for now), an +d # of course we also ignore any 'undef' windows. next unless scalar @{ $data{$id} }[$window] >= 2; foreach my $individual ( keys(%data) ) {` [download] First, I think that you understand that having an undefined $id as a hash key won't work. And to fix this, you want to just rearrange your for loops. `foreach my $id ( keys %data ){ foreach my $window ( @{$data{$id}} ){` [download] Just this will get you much further. There is still a problem with the second loop though, at least in how you use $window in the subsequent code. You are attempting to use $window as an index, but it's not an index. Since you have experience using Data::Printer, I suggest going ahead and checking out what $window is. Now, the way to fix things with the fewest number of overall edits is to change the second for loop. `foreach my $id ( keys %data ){ foreach my $window ( 0 .. $#{$data{$id}} ){ next unless scalar @{ $data{$id}[$window] } >= 2;` [download] Then in your method calls, just change those to align with the scalar test above: `\@{ $data{$id}[$window] }` [download] I hope that helps.	[reply] [d/l] [select]
Re^3: Population of HoAoA based on file contents by iangibson (Scribe) on May 24, 2012 at 19:53 UTC
Thanks for your further help on this. I feel like I'm almost there, but not quite.. I made the changes you suggested, but I'm still getting errors starting at the line `next unless scalar @{ $data{$id}[$window] } >= 2;` [download] The error is: `Can't use an undefined value as an ARRAY reference` [download] Is this referring to $window? Regarding the way I was using it above, was $window the outer array that holds all the smaller arrays? Even if this is true, I don't see why I'm getting the error messages I'm getting.	[reply] [d/l] [select]
Re^4: Population of HoAoA based on file contents by state-o-dis-array (Hermit) on May 25, 2012 at 14:35 UTC
Re^5: Population of HoAoA based on file contents by iangibson (Scribe) on May 25, 2012 at 17:05 UTC
Some notes below your chosen depth have not been shown here
Re: Population of HoAoA based on file contents by iangibson (Scribe) on May 10, 2012 at 16:13 UTC
Thanks for the replies. I obviously didn't describe what I'm trying to accomplish sufficiently well, so let me try again. These are actually huge 1000 Genomes files, which I want to do population genetics analysis on using bioperl Bio::PopGen modules. But rather than passing an entire chromosome to get the statistics from (which I have already done: it works but the results are not genetically relevant), I want to process 100 kilobase windows sequentially and get separate statistics for each window. So I am trying to split my data up accordingly. The particular file that I gave a sample of starts at coordinate (i.e. DNA base) number 16,287,215. Only variant bases are listed in the files; this is why there are large gaps between coordinates. I want to start my windows from '1' (as in general, coordinates will vary for different chromosomes), so in this case windows 0 to 161 will be empty, but this doesn't matter as I simply won't pass empty windows to bioperl for analysis. I was imagining that my data would look something like this: `$data{HG00553}[ [], [], [], ...., [0 0,0 0,0 0,0 0,0 0,0 0,0 0,....], +[], [], etc.] $data{HG00554}[ [], [], [], ...., [0 0,0 0,0 0,0 0,0 0,0 0,0 0,....], +[], [], etc.]` [download] ..where the pairs of zeros here would be in array number 162 (still comma-delimited), and the following arrays would also be populated (with data not given in the tiny sample I show), but the earlier arrays would be empty in this specific case. It seems to me that a hash of arrays of arrays would be the best way to go about this, but please correct me if this is wrong. Then subsequently I can pass a particular window number for all individuals to the popgen module to get my stats for that genomic region, then do this for all windows and for each chromosome. I hope this clarifies what I'm trying to do, and further help would be great! Thanks again.	[reply] [d/l]
Re^2: Population of HoAoA based on file contents by aaron_baugher (Curate) on May 11, 2012 at 19:14 UTC
Ok, I think I get it. Instead of `$data{HG00553}[162]` pointing to an array of two-bit elements as I did, you want it to contain a comma-delimited string of those elements? In that case, you can join them together after the array is built by inserting this line into my code: `$data{HG00553}[162] = join ',', @{$data{HG00553}[162]};` [download] That still leaves the question of what you'd want in any undefined elements of the array, though. Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re^2: Population of HoAoA based on file contents by state-o-dis-array (Hermit) on May 11, 2012 at 14:11 UTC
So you want (?): `$data{HG00554}[162] = (0 0, 0 0, ... , 0 0 )` [download] I'm hoping you've come to understand why you don't want to use the shift in: `$data{$individual}[ [ shift @snp_bins ] ] = $snp;` [download] Assuming so, you could do something like: `for my $index ( 0 .. $#snp_bins ){ push( @{$data{HG00554}[$snp_bins[$index]]}, $snps[$index] ); }` [download] I hope this is helpful, I'm still not convinced I'm understanding what exactly you're needing.	[reply] [d/l] [select]
Re^3: Population of HoAoA based on file contents by iangibson (Scribe) on May 11, 2012 at 17:27 UTC
I think I'm on the right track now. I can see what was wrong with my original code, and I think that the structure you show above should be what I'm looking for. Let me work on it some more (and re-read more carefully the relevant Perl documentation), then if I still have problems I'll either add another comment below or start a fresh post (I won't have much time to work on it until next week, however). Thanks for all your help.	[reply]