Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: What is the best approach to check if a position falls within a target range?

by umasuresh (Hermit)
on Nov 15, 2010 at 22:16 UTC ( [id://871593]=note: print w/replies, xml ) Need Help??


in reply to What is the best approach to check if a position falls within a target range?

Hi Monks, I am posting an updated version of the code which significantly reduced the time for assigning target status (less than 1/2 the time compared to the original version).
use strict; use Data::Dumper; use List::Util qw/max/; # target region my @a = ("100_200","210_310","400_450","475_600", "680_900"); my $query_snp = 205; my $line = quicksort(\@a, \$query_snp); print "$$line"; # loosely based on the quicksort algorithm (Mastering Alorithms with P +erl) sub quicksort { my $array = shift; my $snp = shift; my $start_idx = 0; my $end_idx = scalar @$array - 1; my $mid_point = int( ( $end_idx - $start_idx)/2 ); my $pivot = $start_idx + $mid_point; my ($current_start , $current_end) = split /\_/, $array->[$pivot]; my $current_start_idx = $$snp < $current_start ? $start_idx: $pivo +t; my $current_end_idx = $current_start_idx ==$start_idx ? $pivot: +$end_idx; my $new_array = [ @$array[$current_start_idx..$current_end_idx] ]; my $out_line; # check if the query_snp falls in the pivot region if($end_idx > 1 && $$snp >= $current_start && $$snp <= $current_en +d) { $out_line = "$$snp\t$current_start\t$current_end"; return(\$out_line); } # If there are only two elements in the input array, three cases a +re possible # case1: (100_200, 280_380) # a. query_snp < 100 # b. query_snp > 380 # c. query_snp > 200 && query_snp < 280 elsif($end_idx ==1 ) { if( $$snp > ( split /\_/, $array->[0] )[1] && $$snp < ( split +/\_/, $array->[1] )[0] || $$snp > ( split /\_/, $array->[1] )[1] || $$snp < ( split /\_/, $array->[0] )[0] ) { $out_line = "$$snp\tNOT_IN_TARGET"; return(\$out_line); } # case 2: query _snp lies in one of the target regions # a. query_snp >=100 && <=200 # b. query_snp >=280 && <=380 elsif( $$snp >= ( split /\_/, $array->[0] )[0] && $$snp <= ( s +plit /\_/, $array->[0] )[1] || $$snp >= ( split /\_/, $array->[1] )[0] && $$snp <= ( s +plit /\_/, $array->[1] )[1] ) { $out_line = "$$snp\tyes"; return(\$out_line); } # case3: Doesn't satisfy any of the above and hence is an erro +r else { $out_line = "$$snp\terror\n"; return(\$out_line); } } # if it doesn't lie in the pivot region, do a recursive check of h +alf the array elements each time it enters this loop # this significantly reduces the search time else { quicksort($new_array, $snp); }
Any input is greatly appreciated!
  • Comment on Re: What is the best approach to check if a position falls within a target range?
  • Download Code

Replies are listed 'Best First'.
Re^2: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Feb 12, 2011 at 13:58 UTC
    Hi Monks,
    I would very much appreciate any suggestions on optimizing this algorithm. Typically, the query file contains some 2 million records and the target region has about 200,000 records. In these cases, it takes several hours to assign the target status even with the quicksort algorithm. I am thinking of first dividing the target array into pivot regions of 12.5, 25, 50 75, 87.5 and 100, then checking to see if a query snp lies in a bigger chunk before passing the subset array into the quicksort function. It appears that the recursion in the function is a big drag!
    Is there a quicker way of achieving this?
    Thanks Much,
    Uma
      the query file contains some 2 million records and the target region has about 200,000 records.

      Could you post 3 query file records and 3 target region records?

        Hi BrowserUk,
        query ===== 100 200 300 target_region ============= 10_50 60_150 180_250 expected output =============== 100 60_150 200 180_250
        Thanks Much, Uma

      FWIW. I've a solution that matches 2e6 random integers (0 .. 1000) against 200,000 randomly generated ranges (0..700, 1..300) in 45 minutes using 3GB of ram.

      Of course, of those 2e6 queries only 1000 are unique so it's doing 2000 more work than it needs to.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Hi BrowserUk,
        Thanks much for all your replies. I did some major refactoring of the code these past few days and was able to achieve significant improvement in speed.
        Major changes are:
        1. split the target region into 24 chunks for 24 chromosomes (chr) and only loaded the chr of interest in memory.
        2. Converted the target hash to a target array. This caused a big gain in speed.
        UPDATE
        3.Divided target region into 8 chunks 10-12.5-25-50 and so on. Checked the if the query snp is in which chunk before assigning target status. Uma

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://871593]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-23 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found