in reply to Faulty Control Structures?

I don't have a fix for you, but instead I'm offering some refactoring advice.

  1. Load what you need from the annotation file into a hash.
  2. Might as well do your processing one line at a time, no need to have the whole file in memory.

Here follows some code to implement these first thoughts. I have not changed your use of the global @window, @probe, which would be my next targets for refactoring. But I think after you refactor, you may find it easier to debug.

open my $annotation_read_handle, '<', $annotation_file; my %annotation_for; while (my $ad = <$annotation_read_handle> ) { # read $an_chrom, $prol, $pror out of $ad my ( $an_chrom, undef, undef, $prol, $pror, undef, undef, undef, $mess, ) = split(/\t/, $ad); # read $name out of $mess my (undef, undef, $name) = split(/\;/, $mess); # store for future lookups $annotation_for{$an_chrom} = [ $name, $prol, $pror ]; } # loop through the main data file OLC: while (my $md = <$main_read_handle> ) { # remove newlines chomp $md; # pull out chromosome #, window start, end my ($main_chrom, $winl, $winr) = split(/\t/, $md); # see if $main_chrom has been annotated next OLC if !exists $annotation_for{$main_chrom}; my $array_ref = $annotation_for{$main_chrom}; ( my $name, @probe ) = @$array_ref; # put the window start, end into array for further processing @window = ($winl, $winr); # call the range_finding sub to look for matches my $return = range_find(); next OLC if !$return; # upon matching, print out the name of the gene along with the ori +ginal values print OUTPUT "$name\t $md\n"; }

Replies are listed 'Best First'.
Re^2: Faulty Control Structures?
by bioinformatics (Friar) on Jan 28, 2008 at 23:22 UTC
    Thanks for your input. Unfortunately, this won't work as there isn't a good selection of unique identifiers to use as Keys for the Hash. So, using the code you provide, I'd end up with 23 key-value combinations, when I need 330k :). A hash of arrays would work better, in that I could have the values appended to the arrays for each chromosome, but then getting the data out would be a bit of a nightmare. I will look to cleaning up the globals though, as I was being a bit lazy there :).
    EDIT: Actually, 24 combinations, as there are both x and y to consider :).
    Bioinformatics

      You're right. I overlooked the statement label in one of your next statements. I could not have arrived at my misreading if I had been as aware as you are that there are only two dozen chromosomes.

      But what about your hash of arrays? Why would getting the data out be such a nightmare?

      Populating the hash of arrays:

      open my $annotation_read_handle, '<', $annotation_file; my %annotations_for; while (my $ad = <$annotation_read_handle> ) { # read $an_chrom out of $ad my ($an_chrom, undef) = split(/\t/, $ad); # store for future lookups push @$annotations_for{$an_chrom}, $ad; } close $annotation_read_handle;

      Now read through the main data file and assign each chromosome number to my $main_chrom.

      # look up the list of annotations relevant to the current chromoso +me my $annotations_ref = $annotations_for{$main_chrom}; # loop through just these annotations ILC: foreach my $ad (@$annotations_ref) { # ... }

      Of course—as other more enlightened commentators have already pointed out—the most important thing to optimize is the range_find subroutine.