Find overlap

linseyr has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I want to find the overlapping regions from 4 files which look like this:

file: 148N
chr1    10    50

file: 162N
chr1    9       40

file: 174N
chr1    12      60

file: 175N
chr1    30      45
[download]

If the fist column (chr) is the same in ALL files, i want to compare column 3 and 4 (start and end). If in ALL files the regions are overlapping, I want to return the smallest start position and the largest end position to get the complete region in which there are overlapping reads. So for this input the output would be:

chr1    9    60
[download]

Because there is a region that is overlapping in all files. I want to do this, ONLY if ALL 4 files are overlapping. I wrote a script now to check for all possible combination for overlapping reads, but because of all the many possible combinations I think it wont be finished in this week... This is my script now:


open(N148,'<',"148Nsorted.bed") or die $!;
my @Sample148N = <N148>;
close(N148);

open(N162,'<',"162Nsorted.bed") or die $!;
my @Sample162N = <N162>;
close(N162);

open(N174,'<',"174Nsorted.bed") or die $!;
my @Sample174N = <N174>;
close(N174);

open(N175,'<',"175Nsorted.bed") or die $!;
my @Sample175N = <N175>;
close(N175);

my @results;
my $start;
my $end;
my $overlap = "no-overlap";
my $option1 = ($end148-$start162 >= 1 && $end162 > $end148) && ($end17
+4 - $start148 > 1 && $start174 < $start148) && ($start175 > $start148
+ && $end148 > $end175);
my $option2 = ($end148-$start162 >= 1 && $end162 > $end148) && ($end17
+5 - $start148 > 1 && $start175 < $start148) && ($start174 > $start148
+ && $end148 > $end174);
my $option3 = ($end148-$start174 >= 1 && $end174 > $end148) && ($end17
+5 - $start148 > 1 && $start175 < $start148) && ($start162 > $start148
+ && $end148 > $end162);
my $option4 = ($end148-$start175 >= 1 && $end175 > $end148) && ($end17
+4 - $start148 > 1 && $start174 < $start148) && ($start162 > $start148
+ && $end148 > $end162);
my $option5 = ($end148-$start174 >= 1 && $end174 > $end148) && ($end16
+2 - $start148 > 1 && $start162 < $start148) && ($start175 > $start148
+ && $end148 > $end175);
my $option6 = ($end148-$start175 >= 1 && $end175 > $end148) && ($end16
+2 - $start148 > 1 && $start162 < $start148) && ($start174 > $start148
+ && $end148 > $end174);

my $option7 = ($end162-$start148 >= 1 && $end148 > $end162) && ($end17
+4 - $start162 > 1 && $start174 < $start162) && ($start175 > $start162
+ && $end162 > $end175);
my $option8 = ($end162-$start148 >= 1 && $end148 > $end162) && ($end17
+5 - $start162 > 1 && $start175 < $start162) && ($start174 > $start162
+ && $end162 > $end174);
my $option9 = ($end162-$start174 >= 1 && $end174 > $end162) && ($end17
+5 - $start162 > 1 && $start175 < $start162) && ($start148 > $start162
+ && $end162 > $end148);
my $option10 = ($end162-$start175 >= 1 && $end175 > $end162) && ($end1
+74 - $start162 > 1 && $start174 < $start162) && ($start148 > $start16
+2 && $end162 > $end148);
my $option11 = ($end162-$start174 >= 1 && $end174 > $end162) && ($end1
+48 - $start162 > 1 && $start148 < $start162) && ($start175 > $start16
+2 && $end162 > $end175);
my $option12 = ($end162-$start175 >= 1 && $end175 > $end162) && ($end1
+48 - $start162 > 1 && $start148 < $start162) && ($start174 > $start16
+2 && $end162 > $end174);

my $option13 = ($end174-$start162 >= 1 && $end162 > $end174) && ($end1
+48 - $start174 > 1 && $start148 < $start174) && ($start175 > $start17
+4 && $end174 > $end175);
my $option14 = ($end174-$start162 >= 1 && $end162 > $end174) && ($end1
+75 - $start174 > 1 && $start175 < $start174) && ($start148 > $start17
+4 && $end174 > $end148);
my $option15 = ($end174-$start148 >= 1 && $end148 > $end174) && ($end1
+75 - $start174 > 1 && $start175 < $start174) && ($start162 > $start17
+4 && $end174 > $end162);
my $option16 = ($end174-$start175 >= 1 && $end175 > $end174) && ($end1
+48 - $start174 > 1 && $start148 < $start174) && ($start162 > $start17
+4 && $end174 > $end162);
my $option17 = ($end174-$start148 >= 1 && $end148 > $end174) && ($end1
+62 - $start174 > 1 && $start162 < $start174) && ($start175 > $start17
+4 && $end174 > $end175);
my $option18 = ($end174-$start175 >= 1 && $end175 > $end174) && ($end1
+62 - $start174 > 1 && $start162 < $start174) && ($start148 > $start17
+4 && $end174 > $end148);

my $option19 = ($end175-$start162 >= 1 && $end162 > $end175) && ($end1
+74 - $start175 > 1 && $start174 < $start175) && ($start148 > $start17
+5 && $end175 > $end148);
my $option20 = ($end175-$start162 >= 1 && $end162 > $end175) && ($end1
+48 - $start175 > 1 && $start148 < $start175) && ($start174 > $start17
+5 && $end175 > $end174);
my $option21 = ($end175-$start174 >= 1 && $end174 > $end175) && ($end1
+48 - $start175 > 1 && $start148 < $start175) && ($start162 > $start17
+5 && $end175 > $end162);
my $option22 = ($end175-$start148 >= 1 && $end148 > $end175) && ($end1
+74 - $start175 > 1 && $start174 < $start175) && ($start162 > $start17
+5 && $end175 > $end162);
my $option23 = ($end175-$start174 >= 1 && $end174 > $end175) && ($end1
+62 - $start175 > 1 && $start162 < $start175) && ($start148 > $start17
+5 && $end175 > $end148);
my $option24 = ($end175-$start148 >= 1 && $end148 > $end175) && ($end1
+62 - $start175 > 1 && $start162 < $start175) && ($start174 > $start17
+5 && $end175 > $end174);

for my $line(@Sample148N){

    my($chr148,$start148,$end148) = split("\t",$line);
   
    for my $line2(@Sample162N){
        my($chr162,$start162,$end162) = split("\t",$line2);

        
        for my $line3(@Sample174N){
            my($chr174,$start174,$end174) = split("\t",$line3);
            
            for my $line4(@Sample175N){
                my($chr175,$start175,$end175) = split("\t",$line2);
                
                if(($chr148 eq $chr162 && $chr148 eq $chr174 && $chr14
+8 eq $chr175) && ($option1 || $option2 || $option3 || $option4 || $op
+tion5 || $option6 || $option7 || $option8 || $option9 || $option10 ||
+ $option11 || $option12 || $option13 || $option14 || $option15 || $op
+tion16 || $option17 || $option18 || $option19 || $option20 || $option
+21 || $option22 || $option23 || $option24){
                    $overlap = "overlap";
                    
                    $start = $start148 if ($start148 < $start162 && $s
+tart148 <$start174 && $start148 < $start175);
                    $end = $end148 if ($end148 > $end162 && $end148 > 
+$end174 && $end148 > $end175);
                    $start = $start162 if ($start162 < $start148 && $s
+tart162 <$start174 && $start162 < $start175);
                    $end = $end162 if ($end162 > $end148 && $end162 > 
+$end174 && $end162 > $end175);
                    $start = $start174 if ($start174 < $start162 && $s
+tart174 <$start148 && $start174 < $start175);
                    $end = $end174 if ($end174 > $end162 && $end174 > 
+$end148 && $end174 > $end175);
                    $start = $start175 if ($start175 < $start148 && $s
+tart175 <$start174 && $start175 < $start162);
                    $end = $end175 if ($end175 > $end148 && $end175 > 
+$end174 && $end175 > $end162);
            
        }
    }
   print "$chr148   $start  $end";
}
[download]

Could somebody help me find an easier way to do this?\ Thanks a lot!

Comment on Find overlap Select or Download Code

Replies are listed 'Best First'.
Re: Find overlap by frozenwithjoy (Priest) on Oct 14, 2012 at 00:19 UTC
Hi, a few questions to start off: You say there are 4 columns/line, but I only see 3, or do you mean 'chr' is in column 1 and the chromosome # is in column 2? Related to this, are there multiple chromosomes represented/file or are all regions from chr1? Approximately how many ranges are in each file? Finally, are the columns tab-delimited or?	[reply]
Re^2: Find overlap by linseyr (Acolyte) on Oct 14, 2012 at 01:06 UTC
Hi, No there are three columns per line: chr, start and end. There are 4 different files which contain these three columns. In total there are 24 chromosomes, and every files contains a different number of lines per chromosome. Every file has aprox 240,000 lines. so about 10,000 per chromosome.	[reply]
Re^3: Find overlap by Anonymous Monk on Oct 14, 2012 at 02:47 UTC
hi, i'm not sure if i get the problem right but this script will find the min of all values in column 2 of all given files and the max of all values in column 3. you just call it with the file names as arguments. `#!/usr/bin/perl -w use strict; my $min = undef; my $max = undef; foreach (@ARGV) { open(F, "< $_"); while(my $line = <F>) { chop($line); $line =~ s/[\ \t]+/:/g; my @p = split(/:/,$line); if(!defined($min)) { $min = $p[1]; } if(!defined($max)) { $max = $p[2]; } if($p[1] < $min) { $min = $p[1]; } if($p[2] > $max) { $max = $p[2]; } } close(F); } print "$min $max\n";` [download]	[reply] [d/l]
Re^3: Find overlap by Anonymous Monk on Oct 14, 2012 at 03:37 UTC
but if you need the first column as well, and if the columns are tab separated, this script would give you the minima of chr1..chrN of column2 of all given files and the maxima of chr1..chrN of column3 of all given files. do you need the file names as well? and i hope i understood the problem right #!/usr/bin/perl -w use strict; my %minmax; foreach (@ARGV) { open(F, "< $_"); while(my $line = <F>) { chop($line); my @p = split(/\t/,$line); if(!exists( $minmax{ $p[0] }{ min } )) { $minmax{ $p[0] }{ min } = $p[1]; } if($p[1] < $minmax{ $p[0] }{ min }) { $minmax{ $p[0] }{ min } = $p[1]; } if(!exists( $minmax{ $p[0] } { max } )) { $minmax{ $p[0] }{ max } = $p[2]; } if($p[2] > $minmax{ $p[0] }{ max }) { $minmax{ $p[0] }{ max } = $p[2]; } } close(F); } foreach (keys %minmax) { print "$_: min: $minmax{$_}{min} max: $minmax{$_}{max}\n"; } [download] for the given input: file1: `chr1 4 60 chr2 2 40 chr3 4 90 chr1 5 40` [download] file2: `chr2 1 30 chr1 6 20 chr4 9 100` [download] file3: `chr1 2 20 chr2 2 90 chr1 6 20 chr4 4 30` [download] file4: `chr2 4 90 chr3 3 90 chr2 4 90 chr4 3 90 chr2 4 30` [download] it would output: `chr1: min: 2 max: 60 chr2: min: 1 max: 90 chr3: min: 3 max: 90 chr4: min: 3 max: 100` [download]	[reply] [d/l] [select]
Re^4: Find overlap by Kenosis (Priest) on Oct 14, 2012 at 07:53 UTC
Re^4: Find overlap by Anonymous Monk on Oct 14, 2012 at 16:28 UTC
Re^5: Find overlap by Cristoforo (Curate) on Oct 17, 2012 at 15:42 UTC
Re: Find overlap by jwkrahn (Abbot) on Oct 14, 2012 at 10:01 UTC
Here is one way to do it: #!/usr/bin/perl use warnings; use strict; @ARGV = ( '148Nsorted.bed', '162Nsorted.bed', '174Nsorted.bed', '175Ns +orted.bed' ); my %data; while ( <> ) { /^\s(\S+)\s+(\d+)\s+(\d+)\s$/ or next; $data{ $1 } \|= '0' x ( $2 - 1 ) . '1' x ( $3 - ( $2 - 1 ) ); } keys( %data ) == 1 or die "Error: too many keys.\n"; my ( $name, $string ) = each %data; $string =~ /10+1/ and die "Error: no overlap.\n"; $string =~ /^01/ and my $start = $+[ 0 ]; $string =~ /.1/ and my $end = $+[ 0 ]; print "$name\t$start\t$end\n"; [download]	[reply] [d/l]
Re^2: Find overlap by Anonymous Monk on Oct 14, 2012 at 16:32 UTC
Hi, Thanks but this doesn't work for these files. I get an error "Too many keys"	[reply]
Re^2: Find overlap by linseyr (Acolyte) on Oct 14, 2012 at 16:46 UTC
Sorry but im pretty new with perl so I dont really know what this code does. Could you give me some explanation? And why do I get the error: Too many keys when I try to run it? Thanks.	[reply]
Re^3: Find overlap by jwkrahn (Abbot) on Oct 16, 2012 at 04:51 UTC
The code uses the bit-wise OR operator (\|) to turn all bytes in the string in the range to the character '1'. And why do I get the error: Too many keys when I try to run it? The keys of %data represent the first column of the data files ('chr1') so if you get that error message it means that there was something other than 'chr1' in one of the files. After thinking about the problem, and rereading it, it seems that ALL files must overlap so this may work better: #!/usr/bin/perl use warnings; use strict; @ARGV = ( '148Nsorted.bed', '162Nsorted.bed', '174Nsorted.bed', '175Ns +orted.bed' ); my ( $bit_mask, %data ) = 1; while ( <> ) { /^\s(\S+)\s+(\d+)\s+(\d+)\s$/ or next; $data \|= chr( 0 ) x ( $2 - 1 ) . chr( $bit_mask ) x ( $3 - ( $2 - +1 ) ); $bit_mask <<= 1; } keys( %data ) == 1 or die "Error: too many keys: @{[ keys %data ]}\n"; my ( $name, $string ) = each %data; $string =~ /\x0f/ or die "Error: no overlap on all files."; $string =~ /[^\0]/ and my ( $start, $end ) = ( $+[ 0 ], length $string + ); print "$name\t$start\t$end\n"; [download]	[reply] [d/l]
Re: Find overlap by remiah (Hermit) on Oct 14, 2012 at 02:05 UTC
Hello lunseyr. I guess maybe something like this ...Input <DATA> may be diffrent from your files. This example prints overlaps:9,200 no overlaps for chr2 I just thought packages may be good for adding constraints to your items. Here I used Moose, but I am feeling something not good. I hope someone points me for better design. Read more... (2 kB) regards.	[reply] [d/l]
Re: Find overlap by ig (Vicar) on Oct 14, 2012 at 19:17 UTC
Update: sorry, but this doesn't work. It misses many overlapping sets. Read more... (3 kB) update: the following might do what you want. It will take more memory and time than the above but it will find all overalpping sets. use strict; use warnings; use List::Util qw(min max); my $ranges148 = scan('148Nsorted.bed'); my $ranges162 = scan('162Nsorted.bed'); my $ranges174 = scan('174Nsorted.bed'); my $ranges175 = scan('175Nsorted.bed'); foreach my $chr (keys %$ranges148) { foreach my $range148 (@{$ranges148->{$chr}}) { foreach my $range162 (@{$ranges162->{$chr}}) { foreach my $range174 (@{$ranges174->{$chr}}) { foreach my $range175 (@{$ranges175->{$chr}}) { my $minstart = min( $range148->[0], $range162->[0], $range174->[0], $range175->[0]); my $maxstart = max( $range148->[0], $range162->[0], $range174->[0], $range175->[0]); my $minend = min( $range148->[1], $range162->[1], $range174->[1], $range175->[1]); my $maxend = max( $range148->[1], $range162->[1], $range174->[1], $range175->[1]); if($maxstart <= $minend) { print "overlapping $chr: $minstart, $maxend - +($range148->[0], $range148->[1]), ($range162->[0], $range162->[1]), ( +$range174->[0], $range174->[1]), ($range175->[0], $range175->[1])\n"; } else { print "not overlapping: $chr: $minstart, $maxe +nd - ($range148->[0], $range148->[1]), ($range162->[0], $range162->[1 +]), ($range174->[0], $range174->[1]), ($range175->[0], $range175->[1] +)\n"; } } } } } } exit(0); sub scan { my ($file) = @_; open(my $fh, '<', $file) or die "$file: $!"; my $ranges; foreach my $line (<$fh>) { chomp($line); my ($chr, $start, $end) = split(/\t/, $line); push(@{$ranges->{$chr}}, [ 0+$start, 0+$end ]); } return($ranges); } [download]	[reply] [d/l] [select]
Re^2: Find overlap by linseyr (Acolyte) on Oct 14, 2012 at 20:21 UTC
Hi, Thanks for your answer. I dont really get the final part. Why do you substract the ranges from maxend? Now it just prints: not overlapping: chr7: 41396, 22342744 - (41396, 41743), (41610, 41757), (884716, 884900), (22342598, 22342744). Not a number. Could you explain this part? Thanks!	[reply]
Re^3: Find overlap by linseyr (Acolyte) on Oct 14, 2012 at 20:25 UTC
Oh im sorry i get it. you just print the ranges out the files.!	[reply]

Back to Seekers of Perl Wisdom