http://www.perlmonks.org?node_id=960990


in reply to Help the counting!

What is bp?

Without exactly knowing what you want I can tell you that your script seems to have 3 loops nested into each other which will make the running time of your script unbearable for non-trivial data. And your problem doesn't seem to need that.

How about this algorithm: Sort the ranges in ascending start position (you need to use a more complex data structure like Array-of-Arrays for this). Store the range of the first cluster into $xstart and $xend. For each new cluster (outer loop) test if the new cluster overlaps and if yes, count the overlap (inner loop) and update $xend with the end of the range

Replies are listed 'Best First'.
Re^2: Help the counting!
by g-alone (Initiate) on Mar 22, 2012 at 12:39 UTC
    I guess Easiest way is to show you what my data looks like in simpler way bp = basepair I have data of different chromosomes with start and end point of cluster tags in each of chromosomes it looks like :
    columns = 1: chr 2: start 3:end 4: info(X) 5: info(X) 6:strand chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 131 132 X X - I need column 1 - 2 -3 - 6 I want to count the overlappes with 2 basepair overlappes for each cor +dinates in each chromosome and give ID number for those over laps tog +ether like in chr1 there are 4 cordinates and 3 of them have overlapp +ed except the last cordinate : so I will need to first count those 3 ove +rlappes and give them ID_1 then ID_2 will give to last cordinate in chr1 which + is not have overlapped with others . then Counting is became zero for the chr2 and check inside the chr2 co +rdinates for overlap and give ID from ID_1 for chr2 again and countin +g and so on for all chromosomes and give the out put like this : TSSD_ID chr start end strand count ID_1 1 101 111 - 3 ID_2 1 112 113 - 1 ID_3 1 113 115 - 1 ID_1 2 114 118 - 1 ID_2 2 119 123 - 2 ID_1 3 125 132 - 1
    this the output I want to get but Mine is not working well !

      I think you made a mistake with chromosom 3 in your example output

      Here is a working solution:

      #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my @chrom; while (<>) { my ($chr, $start,$end,$c,$d,$strand)= split; push @chrom, {'chrom'=>$chr, 'start'=>$start, 'end'=>$end, 'strand +'=>$strand }; } my @chroms= sort { $a->{start} <=> $b->{start} } @chrom; #print Dumper(\@chroms); exit(0) if (@chroms==0); my $coord= shift @chroms; #use first coordinate as range counter my $overlap=1; my $id= 1; while( my $co= shift @chroms ) { if ($co->{chrom} ne $coord->{chrom}) { printresults($coord,$overlap,$id); $coord= $co; $overlap=1; $id=1; } else { if ($co->{start}>=$coord->{end}) { printresults($coord,$overlap,$id); $coord= $co; $overlap=1; $id++; } else { $overlap++; $coord->{end}= $co->{end}; } } } printresults($coord,$overlap,$id); #------------- sub printresults { my ($coord,$overlap,$id)= @_; print "ID_$id $coord->{chrom} $coord->{start} $coor +d->{end} $coord->{strand} $overlap\n"; }
      prints
      ID_1 chr1 101 111 - 3 ID_2 chr1 112 113 - 1 ID_3 chr1 113 115 - 1 ID_1 chr2 114 118 - 1 ID_2 chr2 119 123 - 2 ID_1 chr3 125 130 - 1 ID_2 chr3 131 132 - 1

      You can remove the '#' in front of the 'print Dumper' line if you want to see how the data in @chroms looks like