http://www.perlmonks.org?node_id=664855


in reply to Faulty Control Structures?

A few ways to optimise this:

  1. You mentioned somewhere that hashes are not useful as there are only 24 keys. You also mentioned about using a hash of arrays, but dismissed it.

    That's premature. If you load your main file into a hash of arrays, then you only need compare each of your annotated records against 1/24th of the main records. And that reduces your overall workload and runtime to 1/24th of what it would be.

    my %main; my $mfile = prompt( "Main file?" ); open MAIN, '<', $mfile or die "$mfile: $!"; m[^([^\t]+)\t] and push @{ $main{ $1 } }, $_ or die "Bad record in $mfile @ $." while <MAIN>; close MAIN; chomp @$_ for values %main;
  2. There is no reason to load the second file into ram. You can process it one record at a time:
    my $afile = prompt( "Annotation file?" ); open ANNOT, '<', $afile or die "$afile: $!"; open OUTPUT, '>', 'output.txt' or die $!;

    Now your main loops become:

    OUTER: while( <ANNOT> ) { chomp; my( $atag, $astart, $aend, $mess ) = split "\t"; next unless exists $main{ $atag }; for my $md ( @{ $main{ $atag } } ) { my( $mtag, $mstart, $mend ) = split "\t", $md; if( ?OVERLAPPED? ) { print OUTPUT "$atag\t$mess"; } } }
  3. As others have pointed out, your range checking is far too complex and computationally expensive. Although there are many (7?) permutations of relative positioning of the starts and ends of the ranges, there are only 2 that do not overlap:
    1. Main end less than Annot start:
      Ms------Me As------Ae
    2. Annot end less than Main start:
      Ms------Me As------Ae

    Any other position is overlapped.

    If both your files are already in sorted order (if they are not, (learn to) use your system sort utility and sort them ascending by the tag, start, end), then when Main Start moves greater than Annot End, then you need process no further main records for this tag, so we can skip to the next Annot record. (Outer loop)

    Equally, whilst Main End is less than Annot Start, you can skip to the next Main record. (Inner loop).

    And actually, that covers both (all) conditions when we do not want to print output, so no further range testing is required.

    That makes the loops look like this:

    OUTER: while( <ANNOT> ) { chomp; my( $atag, $astart, $aend, $mess ) = split "\t"; next unless exists $main{ $atag }; for my $md ( @{ $main{ $atag } } ) { my( $mtag, $mstart, $mend ) = split "\t", $md; next if $mend < $astart; next OUTER if $mstart > $aend; print OUTPUT "$atag\t$mess"; } }
  4. Finally, there is one further optimisation possible. As both files are sorted, when we start processing the second Annot record for a given tag, then we know it cannot be earlier than the last Annot record for this tag.

    Therefore, if we rejected and skipped the first N Main records for this tag for the last Annot record, we know we can skip at least that many for this record. This complicates the code a little, but yields a huge saving in processing which make it worth while.

The final cut of the loops code looks like this:

my( $first, $lastTag ) = ( 0, '' ); OUTER: while( <ANNOT> ) { chomp; ## ((simplified for testing) my( $atag, $astart, $aend, $mess ) = split "\t"; ## If the tag changed, reset the start position for the Main recor +ds ## and remember the new tag $first = 0, $lastTag = $atag if $atag ne $lastTag; ## Skip completely if this Annot tag has no corresponding Main tag next unless exists $main{ $atag }; ## For each Main record with this tag, ## But skipping those we rejected last time for my $md ( @{ $main{ $atag } }[ $first .. $#{ $main{ $atag } } ] + ) { ## (simplified for testing) my( $mtag, $mstart, $mend ) = split "\t", $md; ## Increment the skip and goto the next Main record ## If the end of this Main record is less than ## the start of the Annot record ++$first, next if $mend < $astart; ## Skip the rest of these Main record ## if the start of this Main record is greater than ## the end of this Annot record next OUTER if $mstart > $aend; ## We have an overlap, so output the info print OUTPUT "$atag\t$mess"; } }

The result is that it now takes less that six minutes to process 300,000 record against 300,000 records, finding 113,000 overlaps:

C:\test>664760-gen > 664760.main C:\test>wc -l 664760.main 300000 664760.main C:\test>664760-gen > 664760.annot C:\test>wc -l 664760.annot 300000 664760.annot C:\test>664760-b Main file?664760.main Annotation file?664760.annot Tue Jan 29 06:36:34 2008 Tue Jan 29 06:42:12 2008 C:\test>wc -l output.txt 112805 output.txt

My best guess is this 3 if not 4 orders of magnitude quicker. Let's see anyone come close to that using an RDBMS.

Full test code:

#! perl -slw use strict; my %main; ## Get the filename for and open the main file my $mfile = prompt( "Main file?" ); open MAIN, '<', $mfile or die "$mfile: $!"; ## Build a HoAs keys by the tag (chromosome?) ## Don't split the records yet to save ram. m[^([^\t]+)\t] and push @{ $main{ $1 } }, $_ or die "Bad record in $mfile @ $." while <MAIN>; close MAIN; chomp @$_ for values %main; ## Chomp the arrays ## Get the filename for and open the annotations file my $afile = prompt( "Annotation file?" ); open ANNOT, '<', $afile or die "$afile: $!"; ## Open the output file open OUTPUT, '>', 'output.txt' or die $!; ## Record our start time print scalar localtime; ## Variables to remember the llast Annot tag we processed ## And how many Main records we skipped before we found an overlap my( $first, $lastTag ) = ( 0, '' ); OUTER: while( <ANNOT> ) { chomp; ## Read and split the Annot record my( $atag, $astart, $aend, $mess ) = split "\t"; ## If the tag changed, reset the start porsition for the Main reco +rds ## and remeber the new tag $first = 0, $lastTag = $atag if $atag ne $lastTag; ## Skip completely if this Annot tag has no corresponding Main tag next unless exists $main{ $atag }; ## For each Main record with this tag, ## But skipping those we rejected last time for my $md ( @{ $main{ $atag } }[ $first .. $#{ $main{ $atag } } ] + ) { ## Split the Main record. We do it over and over, but it's not + hugely expensive. ## Split when building the HoAs if you have enough memory. my( $mtag, $mstart, $mend ) = split "\t", $md; ## Increment the skip and goto the next Main record ## If the end of this Main record is less than ## the start of the Annot record ++$first, next if $mend < $astart; ## Skip the rest of these Main record ## if the start of this Main record is greater than ## the end of this Annot record next OUTER if $mstart > $aend; ## We have an overlap, so output the info print OUTPUT "$atag\t$mess"; } } close ANNOT; close OUTPUT; print scalar localtime; sub prompt { printf "%s", $_[ 0 ]; chomp( my $in = <STDIN> ); return $in; } __END__ C:\test>664760-gen > 664760.main C:\test>wc -l 664760.main 300000 664760.main C:\test>664760-gen > 664760.annot C:\test>wc -l 664760.annot 300000 664760.annot C:\test>664760-b Main file?664760.main Annotation file?664760.annot Tue Jan 29 06:36:34 2008 Tue Jan 29 06:42:12 2008 C:\test>wc -l output.txt 112805 output.txt

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.