Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Population of HoAoA based on file contents

by iangibson (Scribe)
on May 09, 2012 at 18:58 UTC ( [id://969678]=perlquestion: print w/replies, xml ) Need Help??

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to request help from a Monk on how to build a complex data structure, based on a file's contents. I am new to this, and despite having studied various tutorials on this topic, I cannot figure out the correct procedure.

I am trying to put the data from a file into a Hash of Arrays of Arrays, where the individuals (who each have their own line in the file) are the hash keys, and have the data (consisting of comma-separated pairs of zeros and ones) associated with them broken up into 100 kilobase (kb) windows, based on the genomic coordinates that are in the header line of each file. Each 100kb segment will therefore have its own array within a larger array that encompasses the entire data for that individual.

The code I have so far is listed below, followed by a sample of one of the files I need to work with. In this example, the first several interior arrays would be empty (we start with number 162, which all of the data in the few columns listed would go into), but for other files this would not be the case. For example, for each individual (beginning on line 2 of the input file) I want all data for the columns corresponding to header line coordinates between 1-99,999 (if any) to go into the first array, then 100,000-199,999 into the second array, and so on.

Help on this would be most appreciated.

#!/usr/bin/perl use warnings; use strict; use v5.14; die "need two arguments (i.e. chr cont) at invocation" unless @ARGV == + 2; chomp( my $chr_num = shift ); chomp( my $cont = shift ); open my $out_file, ">", "chr${chr_num}_exome_snps_processed_${cont}_ST +ATS" or die "Can't open output file: $!\n"; # Get a list of individuals (will be hash keys later): open my $in_file, "<", "chr${chr_num}_exome_snps_processed_$cont" or die "Can't open input file: $!\n"; my @individuals; my %data; while (<$in_file>) { chomp; my @snp_bins; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; push @individuals, $id; foreach my $individual (@individuals) { foreach my $snp (@snps) { $data{$individual}[ [ shift @snp_bins ] ] = $snp; } } } } close $in_file;
## Sample of data file. Each file has hundreds of thousands of columns + and hundreds of rows SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0

Replies are listed 'Best First'.
Re: Population of HoAoA based on file contents
by state-o-dis-array (Hermit) on May 09, 2012 at 20:57 UTC
    So, if I understand correctly, you want your results to look something like:
    $data{"HG00553"}[162][16287215] = "0 0" $data{"HG00553"}[162][16287226] = "0 0" ... $data{"HG00638"}[162][16287851] = "1 1" ...
    First, I'd suggest you do a little debug work to understand what your code is doing. Specifically:
    while (<$in_file>) { print "$_"; chomp; my @snp_bins; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; push @individuals, $id; foreach my $individual (@individuals) { print "working on $individual\n"; foreach my $snp (@snps) { my $snp_bin = shift @snp_bins; print "snp_bin $snp_bin\n"; $data{$individual}[ [ $snp_bin ] ] = $snp; } } print "snp_bins: @snp_bins\n"; } }
    I think that you'll find some things which you didn't expect. Also, I think that you might be better off using HoHoH instead. I believe that doing:
    $data{$individual}[162] = "something"
    will result in:
    $data{$individual}[0..162]
    even though you are only concerned with the former. So it would probably be better to:
    $data{$individual}{162}{16287215} = "0 0"
Re: Population of HoAoA based on file contents
by aaron_baugher (Curate) on May 09, 2012 at 22:10 UTC

    One quick note: since you only have one header line, and you know it's the first line, there's no need to check for it every time through your loop. Read the header line, process it, and then start your loop on the second line.

    I'll assume that you really do want a HoAoA, despite the large empty areas it may have within it. In that case, I'd start by reading that header line and making an array of two-element arrays, the first element containing the first part (here 162), and the second element containing the remainder (87215 for the first one). Then as you go through the rest of the lines, you can easily step through this array to get the indexes you'll need for your HoAoA. When you start having indexes within indexes, that's when it gets mind-stretching, but fun. Something like this:

    #!/usr/bin/env perl use Modern::Perl; my %data; # HoAoA holding final data structure my @indexes; # array of indexes into %data based on coord +inates my $header_line = <DATA>; # some of these steps could be combined; chomp $header_line; # including them for clarity my @headers = split ',', $header_line; shift @headers; # throw away SAMPLE for my $n (@headers){ my $x = $n/100_000; my $y = $n % 100_000; push @indexes, [ $x, $y ]; } while(<DATA>){ chomp; my( $id, @bits ) = split ','; for my $i (0..@bits-1){ $data{$id}[$indexes[$i][0]][$indexes[$i][1]] = $bits[$i]; } } say "Expecting 0 0: ", $data{HG00553}[162][87365]; say "Expecting 1 1: ", $data{HG00638}[162][87851]; __DATA__ SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Population of HoAoA based on file contents
by iangibson (Scribe) on May 14, 2012 at 19:29 UTC

    Okay, I think I've made some progress, but I'm still not quite there yet. Here's what I now have:

    #!/usr/bin/perl use warnings; use strict; use v5.14; use Getopt::Long; use Bio::PopGen::IO; use Bio::PopGen::Statistics; die "need two arguments (i.e. chr cont) at invocation" unless @ARGV == + 2; chomp( my $chr_num = shift ); chomp( my $cont = shift ); open my $out_file, ">", "chr${chr_num}_exome_snps_processed_${cont}_ST +ATS" or die "Can't open output file: $!\n"; open my $in_file, "<", "chr${chr_num}_exome_snps_processed_$cont" or die "Can't open input file: $!\n"; my %data; my @snp_bins; my @individuals; my @all_snps; while (<$in_file>) { chomp; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; push @individuals, $id; push @all_snps[$. - 2], join(',', @snps); } } foreach my $individual (@individuals) { foreach my $index ( 0 .. $#snp_bins ) { push( @{ $data{$individual}[ $snp_bins[$index] ] }, $all_snps[ +$index] ); } } close $in_file;

    But there's still (at least) a problem with the line

    push @all_snps[$. - 2], join(',', @snps);

    I hope I'm otherwise headed in the right direction..?

    In regard to what I will do with undefined bins: I will iterate through all the bins, and any that don't have a minimum number of elements simply won't be passed as data to the bioperl popgen stats methods, later on in the program.

      It seems to me that this would do what you are trying to accomplish.
      while (<$in_file>) { chomp; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; #need to check here that $#snps == $#snp_bins ? foreach my $index ( 0 .. $#snp_bins ) { push( @{ $data{$id}[ $snp_bins[$index] ] }, $snps[$index] ); } } }
      Unless you need them elsewhere, I don't see a reason in your code snippent to store @individuals and @all_snps. The "need to check" comment above is based on my attempt to understand your last statement above. If it's possible that @snps might not have an entry for each @snp_bins, then you'll want this check.
Re: Population of HoAoA based on file contents
by iangibson (Scribe) on May 22, 2012 at 19:02 UTC

    Thanks! I rewrote my code following your advice. I used Data::Printer to check it and I think I've finally got my data into the structure that I wanted.

    In the meantime, I've been trying to work out how to process my data from the HoAoA. What I want to do is pass the first interior array (as a block) for every individual to the module methods, then pass the second array in the same way, and so on for each of the array positions in turn. The goal is to treat each array position as a separate window for the purposes of analysis. Just to make it clear, here is a simplified version of my structure:

    my %hash; $hash{key1} = [ [0 0, 0 1, 0 0, 1 1, 0 0, 0 1,], # first block [0 0, 0 0, 0 0, 0 1, 0 1, 0 0,], # second block [0 1, 0 0, 0 0, 0 1, 0 0, 0 0,], # third block ]; $hash{key2} = [ [0 0, 0 0, 0 0, 0 1, 1 1, 0 0,], # first block [0 0, 0 1, 0 0, 0 0, 0 0, 0 0,], # second block [0 0, 1 1, 0 0, 1 1, 0 0, 1 0,], # third block ];

    In the above case, I'd want to pass to the module methods (shown in the code below) the arrays labeled '#first block' for both individuals first, then those labeled '#second block', and finally those labeled '#third block', so that I get separate statistics for each genomic region.

    I have been trying to do this on my actual data. So far, I have produced the following (broken) code:

    my $id; # ?? foreach my $window ( @{ $data{$id} } ) { # only pass windows with > min number of stats (i.e. >1 for now), an +d # of course we also ignore any 'undef' windows. next unless scalar @{ $data{$id} }[$window] >= 2; foreach my $individual ( keys(%data) ) { my $segsites = Bio::PopGen::Statistics->segregating_sites_count( \ +@{ $data{$id} }[$window] ); my $singletons = Bio::PopGen::Statistics->singleton_count( \@{ $da +ta{$id} }[$window] ); my $pi = Bio::PopGen::Statistics->pi( \@{ $data{$id} }[$window] ); my $theta = Bio::PopGen::Statistics->theta( \@{ $data{$id} }[$wind +ow] ); my $tajima_D = Bio::PopGen::Statistics->tajima_D( \@{ $data{$id} } +[$window] ); my $D_star = Bio::PopGen::Statistics->fu_and_li_D_star( \@{ $data{ +$id} }[$window] ); my $F_star = Bio::PopGen::Statistics->fu_and_li_F_star( \@{ $data{ +$id} }[$window] ); # output follows: say $out_file "Population: $cont\tChromosome: $chr_num"; say $out_file "Seg sites\tSingletons\tPi\tTheta\tTajima's D\tFu & +Li F*\tFu & Li D*"; # to account for invariant sites, divide stats by window size (i.e +. 100_000) my @stats = qw($segsites $singletons $pi $theta $tajima_D $F_star +$D_star); foreach my $stat (@stats) { print $out_file $stat / 100_000, "\t" } print $out_file "\n"; } }

    This code produces all manner of errors, and I don't really know if my entire approach is incorrect, or if it's just that I don't understand the Perl syntax for working with complex data structures. The various pages on perldoc haven't really helped much. I'd really appreciate some hints on how to proceed with this.

      I'm going to deal with just this section, I think once you understand/update what is happening here, everything else you have will work.
      my $id; # ?? foreach my $window ( @{ $data{$id} } ) { # only pass windows with > min number of stats (i.e. >1 for now), an +d # of course we also ignore any 'undef' windows. next unless scalar @{ $data{$id} }[$window] >= 2; foreach my $individual ( keys(%data) ) {
      First, I think that you understand that having an undefined $id as a hash key won't work. And to fix this, you want to just rearrange your for loops.
      foreach my $id ( keys %data ){ foreach my $window ( @{$data{$id}} ){

      Just this will get you much further. There is still a problem with the second loop though, at least in how you use $window in the subsequent code. You are attempting to use $window as an index, but it's not an index. Since you have experience using Data::Printer, I suggest going ahead and checking out what $window is.

      Now, the way to fix things with the fewest number of overall edits is to change the second for loop.

      foreach my $id ( keys %data ){ foreach my $window ( 0 .. $#{$data{$id}} ){ next unless scalar @{ $data{$id}[$window] } >= 2;
      Then in your method calls, just change those to align with the scalar test above:
      \@{ $data{$id}[$window] }
      I hope that helps.

        Thanks for your further help on this. I feel like I'm almost there, but not quite.. I made the changes you suggested, but I'm still getting errors starting at the line

        next unless scalar @{ $data{$id}[$window] } >= 2;

        The error is:

        Can't use an undefined value as an ARRAY reference

        Is this referring to $window? Regarding the way I was using it above, was $window the outer array that holds all the smaller arrays? Even if this is true, I don't see why I'm getting the error messages I'm getting.

Re: Population of HoAoA based on file contents
by iangibson (Scribe) on May 10, 2012 at 16:13 UTC

    Thanks for the replies. I obviously didn't describe what I'm trying to accomplish sufficiently well, so let me try again.

    These are actually huge 1000 Genomes files, which I want to do population genetics analysis on using bioperl Bio::PopGen modules. But rather than passing an entire chromosome to get the statistics from (which I have already done: it works but the results are not genetically relevant), I want to process 100 kilobase windows sequentially and get separate statistics for each window.

    So I am trying to split my data up accordingly. The particular file that I gave a sample of starts at coordinate (i.e. DNA base) number 16,287,215. Only variant bases are listed in the files; this is why there are large gaps between coordinates. I want to start my windows from '1' (as in general, coordinates will vary for different chromosomes), so in this case windows 0 to 161 will be empty, but this doesn't matter as I simply won't pass empty windows to bioperl for analysis. I was imagining that my data would look something like this:

    $data{HG00553}[ [], [], [], ...., [0 0,0 0,0 0,0 0,0 0,0 0,0 0,....], +[], [], etc.] $data{HG00554}[ [], [], [], ...., [0 0,0 0,0 0,0 0,0 0,0 0,0 0,....], +[], [], etc.]

    ..where the pairs of zeros here would be in array number 162 (still comma-delimited), and the following arrays would also be populated (with data not given in the tiny sample I show), but the earlier arrays would be empty in this specific case. It seems to me that a hash of arrays of arrays would be the best way to go about this, but please correct me if this is wrong.

    Then subsequently I can pass a particular window number for all individuals to the popgen module to get my stats for that genomic region, then do this for all windows and for each chromosome.

    I hope this clarifies what I'm trying to do, and further help would be great! Thanks again.

      Ok, I think I get it. Instead of $data{HG00553}[162] pointing to an array of two-bit elements as I did, you want it to contain a comma-delimited string of those elements? In that case, you can join them together after the array is built by inserting this line into my code:

      $data{HG00553}[162] = join ',', @{$data{HG00553}[162]};

      That still leaves the question of what you'd want in any undefined elements of the array, though.

      Aaron B.
      Available for small or large Perl jobs; see my home node.

      So you want (?):
      $data{HG00554}[162] = (0 0, 0 0, ... , 0 0 )
      I'm hoping you've come to understand why you don't want to use the shift in:
      $data{$individual}[ [ shift @snp_bins ] ] = $snp;
      Assuming so, you could do something like:
      for my $index ( 0 .. $#snp_bins ){ push( @{$data{HG00554}[$snp_bins[$index]]}, $snps[$index] ); }
      I hope this is helpful, I'm still not convinced I'm understanding what exactly you're needing.

        I think I'm on the right track now. I can see what was wrong with my original code, and I think that the structure you show above should be what I'm looking for.

        Let me work on it some more (and re-read more carefully the relevant Perl documentation), then if I still have problems I'll either add another comment below or start a fresh post (I won't have much time to work on it until next week, however). Thanks for all your help.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://969678]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-04-19 17:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found