http://www.perlmonks.org?node_id=998965


in reply to Find overlap

Update: sorry, but this doesn't work. It misses many overlapping sets.

If I understand your objectives correctly, something like the following untested code may help. It assumes the input data is sorted by the first field. The method of detecting overlaps is quite different from what you posted. I don't know if it gives the same results. It has the advantage that it does a single pass over each set of input records.

use strict; use warnings; open(my $N148,'<',"148Nsorted.bed") or die $!; open(my $N162,'<',"162Nsorted.bed") or die $!; open(my $N174,'<',"174Nsorted.bed") or die $!; open(my $N175,'<',"175Nsorted.bed") or die $!; my ($chr148, $start148, $end148) = scan($N148); my ($chr162, $start162, $end162) = scan($N162); my ($chr174, $start174, $end174) = scan($N174); my ($chr175, $start175, $end175) = scan($N175); my ($minstart, $maxstart, $minend, $maxend); while(defined($chr148)) { $minstart = $start148; $maxstart = $start148; $minend = $end148; $maxend = $end148; while(defined($chr162) and $chr162 lt $chr148) { ($chr162, $start162, $end162) = scan($N162); } while(defined($chr162) and $chr162 eq $chr148) { $minstart = $start162 if($start162 < $minstart); $maxstart = $start162 if($start162 > $maxstart); $minend = $end162 if($end162 < $minend); $maxend = $end162 if($end162 > $maxend); if($maxstart <= $minend) { while(defined($chr174) and $chr174 lt $chr148) { ($chr174, $start174, $end174) = scan($N174); } while(defined($chr174) and $chr174 eq $chr148) { $minstart = $start174 if($start174 < $minstart); $maxstart = $start174 if($start174 > $maxstart); $minend = $end174 if($end174 < $minend); $maxend = $end174 if($end174 > $maxend); if($maxstart <= $minend) { while(defined($chr175) and $chr175 lt $chr148) { ($chr175, $start175, $end175) = scan($N175); } while(defined($chr175) and $chr175 eq $chr148) { $minstart = $start175 if($start175 < $minstart +); $maxstart = $start175 if($start175 > $maxstart +); $minend = $end175 if($end175 < $minend); $maxend = $end175 if($end175 > $maxend); if($maxstart <= $minend) { print "$chr148 $minstart $maxend\n"; } ($chr175, $start175, $end175) = scan($N175); } } ($chr174, $start174, $end174) = scan($N174); } } ($chr162, $start162, $end162) = scan($N162); } ($chr148, $start148, $end148) = scan($N148); } close($N148); close($N162); close($N174); close($N175); exit(0); sub scan { my ($fh) = @_; my $line = <$fh>; return() unless($line); return(split(/\t/, $line)); }

update: the following might do what you want. It will take more memory and time than the above but it will find all overalpping sets.

use strict; use warnings; use List::Util qw(min max); my $ranges148 = scan('148Nsorted.bed'); my $ranges162 = scan('162Nsorted.bed'); my $ranges174 = scan('174Nsorted.bed'); my $ranges175 = scan('175Nsorted.bed'); foreach my $chr (keys %$ranges148) { foreach my $range148 (@{$ranges148->{$chr}}) { foreach my $range162 (@{$ranges162->{$chr}}) { foreach my $range174 (@{$ranges174->{$chr}}) { foreach my $range175 (@{$ranges175->{$chr}}) { my $minstart = min( $range148->[0], $range162->[0], $range174->[0], $range175->[0]); my $maxstart = max( $range148->[0], $range162->[0], $range174->[0], $range175->[0]); my $minend = min( $range148->[1], $range162->[1], $range174->[1], $range175->[1]); my $maxend = max( $range148->[1], $range162->[1], $range174->[1], $range175->[1]); if($maxstart <= $minend) { print "overlapping $chr: $minstart, $maxend - +($range148->[0], $range148->[1]), ($range162->[0], $range162->[1]), ( +$range174->[0], $range174->[1]), ($range175->[0], $range175->[1])\n"; } else { print "not overlapping: $chr: $minstart, $maxe +nd - ($range148->[0], $range148->[1]), ($range162->[0], $range162->[1 +]), ($range174->[0], $range174->[1]), ($range175->[0], $range175->[1] +)\n"; } } } } } } exit(0); sub scan { my ($file) = @_; open(my $fh, '<', $file) or die "$file: $!"; my $ranges; foreach my $line (<$fh>) { chomp($line); my ($chr, $start, $end) = split(/\t/, $line); push(@{$ranges->{$chr}}, [ 0+$start, 0+$end ]); } return($ranges); }

Replies are listed 'Best First'.
Re^2: Find overlap
by linseyr (Acolyte) on Oct 14, 2012 at 20:21 UTC
    Hi, Thanks for your answer. I dont really get the final part. Why do you substract the ranges from maxend? Now it just prints: not overlapping: chr7: 41396, 22342744 - (41396, 41743), (41610, 41757), (884716, 884900), (22342598, 22342744). Not a number. Could you explain this part? Thanks!
      Oh im sorry i get it. you just print the ranges out the files.!