http://www.perlmonks.org?node_id=998890

linseyr has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I want to find the overlapping regions from 4 files which look like this:
file: 148N chr1 10 50 file: 162N chr1 9 40 file: 174N chr1 12 60 file: 175N chr1 30 45
If the fist column (chr) is the same in ALL files, i want to compare column 3 and 4 (start and end). If in ALL files the regions are overlapping, I want to return the smallest start position and the largest end position to get the complete region in which there are overlapping reads. So for this input the output would be:
chr1 9 60
Because there is a region that is overlapping in all files. I want to do this, ONLY if ALL 4 files are overlapping. I wrote a script now to check for all possible combination for overlapping reads, but because of all the many possible combinations I think it wont be finished in this week... This is my script now:
open(N148,'<',"148Nsorted.bed") or die $!; my @Sample148N = <N148>; close(N148); open(N162,'<',"162Nsorted.bed") or die $!; my @Sample162N = <N162>; close(N162); open(N174,'<',"174Nsorted.bed") or die $!; my @Sample174N = <N174>; close(N174); open(N175,'<',"175Nsorted.bed") or die $!; my @Sample175N = <N175>; close(N175); my @results; my $start; my $end; my $overlap = "no-overlap"; my $option1 = ($end148-$start162 >= 1 && $end162 > $end148) && ($end17 +4 - $start148 > 1 && $start174 < $start148) && ($start175 > $start148 + && $end148 > $end175); my $option2 = ($end148-$start162 >= 1 && $end162 > $end148) && ($end17 +5 - $start148 > 1 && $start175 < $start148) && ($start174 > $start148 + && $end148 > $end174); my $option3 = ($end148-$start174 >= 1 && $end174 > $end148) && ($end17 +5 - $start148 > 1 && $start175 < $start148) && ($start162 > $start148 + && $end148 > $end162); my $option4 = ($end148-$start175 >= 1 && $end175 > $end148) && ($end17 +4 - $start148 > 1 && $start174 < $start148) && ($start162 > $start148 + && $end148 > $end162); my $option5 = ($end148-$start174 >= 1 && $end174 > $end148) && ($end16 +2 - $start148 > 1 && $start162 < $start148) && ($start175 > $start148 + && $end148 > $end175); my $option6 = ($end148-$start175 >= 1 && $end175 > $end148) && ($end16 +2 - $start148 > 1 && $start162 < $start148) && ($start174 > $start148 + && $end148 > $end174); my $option7 = ($end162-$start148 >= 1 && $end148 > $end162) && ($end17 +4 - $start162 > 1 && $start174 < $start162) && ($start175 > $start162 + && $end162 > $end175); my $option8 = ($end162-$start148 >= 1 && $end148 > $end162) && ($end17 +5 - $start162 > 1 && $start175 < $start162) && ($start174 > $start162 + && $end162 > $end174); my $option9 = ($end162-$start174 >= 1 && $end174 > $end162) && ($end17 +5 - $start162 > 1 && $start175 < $start162) && ($start148 > $start162 + && $end162 > $end148); my $option10 = ($end162-$start175 >= 1 && $end175 > $end162) && ($end1 +74 - $start162 > 1 && $start174 < $start162) && ($start148 > $start16 +2 && $end162 > $end148); my $option11 = ($end162-$start174 >= 1 && $end174 > $end162) && ($end1 +48 - $start162 > 1 && $start148 < $start162) && ($start175 > $start16 +2 && $end162 > $end175); my $option12 = ($end162-$start175 >= 1 && $end175 > $end162) && ($end1 +48 - $start162 > 1 && $start148 < $start162) && ($start174 > $start16 +2 && $end162 > $end174); my $option13 = ($end174-$start162 >= 1 && $end162 > $end174) && ($end1 +48 - $start174 > 1 && $start148 < $start174) && ($start175 > $start17 +4 && $end174 > $end175); my $option14 = ($end174-$start162 >= 1 && $end162 > $end174) && ($end1 +75 - $start174 > 1 && $start175 < $start174) && ($start148 > $start17 +4 && $end174 > $end148); my $option15 = ($end174-$start148 >= 1 && $end148 > $end174) && ($end1 +75 - $start174 > 1 && $start175 < $start174) && ($start162 > $start17 +4 && $end174 > $end162); my $option16 = ($end174-$start175 >= 1 && $end175 > $end174) && ($end1 +48 - $start174 > 1 && $start148 < $start174) && ($start162 > $start17 +4 && $end174 > $end162); my $option17 = ($end174-$start148 >= 1 && $end148 > $end174) && ($end1 +62 - $start174 > 1 && $start162 < $start174) && ($start175 > $start17 +4 && $end174 > $end175); my $option18 = ($end174-$start175 >= 1 && $end175 > $end174) && ($end1 +62 - $start174 > 1 && $start162 < $start174) && ($start148 > $start17 +4 && $end174 > $end148); my $option19 = ($end175-$start162 >= 1 && $end162 > $end175) && ($end1 +74 - $start175 > 1 && $start174 < $start175) && ($start148 > $start17 +5 && $end175 > $end148); my $option20 = ($end175-$start162 >= 1 && $end162 > $end175) && ($end1 +48 - $start175 > 1 && $start148 < $start175) && ($start174 > $start17 +5 && $end175 > $end174); my $option21 = ($end175-$start174 >= 1 && $end174 > $end175) && ($end1 +48 - $start175 > 1 && $start148 < $start175) && ($start162 > $start17 +5 && $end175 > $end162); my $option22 = ($end175-$start148 >= 1 && $end148 > $end175) && ($end1 +74 - $start175 > 1 && $start174 < $start175) && ($start162 > $start17 +5 && $end175 > $end162); my $option23 = ($end175-$start174 >= 1 && $end174 > $end175) && ($end1 +62 - $start175 > 1 && $start162 < $start175) && ($start148 > $start17 +5 && $end175 > $end148); my $option24 = ($end175-$start148 >= 1 && $end148 > $end175) && ($end1 +62 - $start175 > 1 && $start162 < $start175) && ($start174 > $start17 +5 && $end175 > $end174); for my $line(@Sample148N){ my($chr148,$start148,$end148) = split("\t",$line); for my $line2(@Sample162N){ my($chr162,$start162,$end162) = split("\t",$line2); for my $line3(@Sample174N){ my($chr174,$start174,$end174) = split("\t",$line3); for my $line4(@Sample175N){ my($chr175,$start175,$end175) = split("\t",$line2); if(($chr148 eq $chr162 && $chr148 eq $chr174 && $chr14 +8 eq $chr175) && ($option1 || $option2 || $option3 || $option4 || $op +tion5 || $option6 || $option7 || $option8 || $option9 || $option10 || + $option11 || $option12 || $option13 || $option14 || $option15 || $op +tion16 || $option17 || $option18 || $option19 || $option20 || $option +21 || $option22 || $option23 || $option24){ $overlap = "overlap"; $start = $start148 if ($start148 < $start162 && $s +tart148 <$start174 && $start148 < $start175); $end = $end148 if ($end148 > $end162 && $end148 > +$end174 && $end148 > $end175); $start = $start162 if ($start162 < $start148 && $s +tart162 <$start174 && $start162 < $start175); $end = $end162 if ($end162 > $end148 && $end162 > +$end174 && $end162 > $end175); $start = $start174 if ($start174 < $start162 && $s +tart174 <$start148 && $start174 < $start175); $end = $end174 if ($end174 > $end162 && $end174 > +$end148 && $end174 > $end175); $start = $start175 if ($start175 < $start148 && $s +tart175 <$start174 && $start175 < $start162); $end = $end175 if ($end175 > $end148 && $end175 > +$end174 && $end175 > $end162); } } print "$chr148 $start $end"; }
Could somebody help me find an easier way to do this?\ Thanks a lot!

Replies are listed 'Best First'.
Re: Find overlap
by frozenwithjoy (Priest) on Oct 14, 2012 at 00:19 UTC
    Hi, a few questions to start off:
    You say there are 4 columns/line, but I only see 3, or do you mean 'chr' is in column 1 and the chromosome # is in column 2?
    Related to this, are there multiple chromosomes represented/file or are all regions from chr1?
    Approximately how many ranges are in each file?
    Finally, are the columns tab-delimited or?
      Hi, No there are three columns per line: chr, start and end. There are 4 different files which contain these three columns. In total there are 24 chromosomes, and every files contains a different number of lines per chromosome. Every file has aprox 240,000 lines. so about 10,000 per chromosome.
        hi,

        i'm not sure if i get the problem right but this script will find the min of all values in column 2 of all given files and the max of all values in column 3.

        you just call it with the file names as arguments.

        #!/usr/bin/perl -w use strict; my $min = undef; my $max = undef; foreach (@ARGV) { open(F, "< $_"); while(my $line = <F>) { chop($line); $line =~ s/[\ \t]+/:/g; my @p = split(/:/,$line); if(!defined($min)) { $min = $p[1]; } if(!defined($max)) { $max = $p[2]; } if($p[1] < $min) { $min = $p[1]; } if($p[2] > $max) { $max = $p[2]; } } close(F); } print "$min $max\n";
        but if you need the first column as well, and if the columns are tab separated, this script would give you the minima of chr1..chrN of column2 of all given files and the maxima of chr1..chrN of column3 of all given files. do you need the file names as well? and i hope i understood the problem right
        #!/usr/bin/perl -w use strict; my %minmax; foreach (@ARGV) { open(F, "< $_"); while(my $line = <F>) { chop($line); my @p = split(/\t/,$line); if(!exists( $minmax{ $p[0] }{ min } )) { $minmax{ $p[0] }{ min } = $p[1]; } if($p[1] < $minmax{ $p[0] }{ min }) { $minmax{ $p[0] }{ min } = $p[1]; } if(!exists( $minmax{ $p[0] } { max } )) { $minmax{ $p[0] }{ max } = $p[2]; } if($p[2] > $minmax{ $p[0] }{ max }) { $minmax{ $p[0] }{ max } = $p[2]; } } close(F); } foreach (keys %minmax) { print "$_: min: $minmax{$_}{min} max: $minmax{$_}{max}\n"; }
        for the given input:

        file1:

        chr1 4 60 chr2 2 40 chr3 4 90 chr1 5 40
        file2:
        chr2 1 30 chr1 6 20 chr4 9 100
        file3:
        chr1 2 20 chr2 2 90 chr1 6 20 chr4 4 30
        file4:
        chr2 4 90 chr3 3 90 chr2 4 90 chr4 3 90 chr2 4 30
        it would output:
        chr1: min: 2 max: 60 chr2: min: 1 max: 90 chr3: min: 3 max: 90 chr4: min: 3 max: 100
Re: Find overlap
by jwkrahn (Abbot) on Oct 14, 2012 at 10:01 UTC

    Here is one way to do it:

    #!/usr/bin/perl use warnings; use strict; @ARGV = ( '148Nsorted.bed', '162Nsorted.bed', '174Nsorted.bed', '175Ns +orted.bed' ); my %data; while ( <> ) { /^\s*(\S+)\s+(\d+)\s+(\d+)\s*$/ or next; $data{ $1 } |= '0' x ( $2 - 1 ) . '1' x ( $3 - ( $2 - 1 ) ); } keys( %data ) == 1 or die "Error: too many keys.\n"; my ( $name, $string ) = each %data; $string =~ /10+1/ and die "Error: no overlap.\n"; $string =~ /^0*1/ and my $start = $+[ 0 ]; $string =~ /.*1/ and my $end = $+[ 0 ]; print "$name\t$start\t$end\n";
      Hi, Thanks but this doesn't work for these files. I get an error "Too many keys"
      Sorry but im pretty new with perl so I dont really know what this code does. Could you give me some explanation? And why do I get the error: Too many keys when I try to run it? Thanks.

        The code uses the bit-wise OR operator (|) to turn all bytes in the string in the range to the character '1'.

        And why do I get the error: Too many keys when I try to run it?

        The keys of %data represent the first column of the data files ('chr1') so if you get that error message it means that there was something other than 'chr1' in one of the files.

        After thinking about the problem, and rereading it, it seems that ALL files must overlap so this may work better:

        #!/usr/bin/perl use warnings; use strict; @ARGV = ( '148Nsorted.bed', '162Nsorted.bed', '174Nsorted.bed', '175Ns +orted.bed' ); my ( $bit_mask, %data ) = 1; while ( <> ) { /^\s*(\S+)\s+(\d+)\s+(\d+)\s*$/ or next; $data |= chr( 0 ) x ( $2 - 1 ) . chr( $bit_mask ) x ( $3 - ( $2 - +1 ) ); $bit_mask <<= 1; } keys( %data ) == 1 or die "Error: too many keys: @{[ keys %data ]}\n"; my ( $name, $string ) = each %data; $string =~ /\x0f/ or die "Error: no overlap on all files."; $string =~ /[^\0]/ and my ( $start, $end ) = ( $+[ 0 ], length $string + ); print "$name\t$start\t$end\n";
Re: Find overlap
by remiah (Hermit) on Oct 14, 2012 at 02:05 UTC

    Hello lunseyr.

    I guess maybe something like this ...Input <DATA> may be diffrent from your files. This example prints

    overlaps:9,200
    no overlaps for chr2
    
    I just thought packages may be good for adding constraints to your items.

    Here I used Moose, but I am feeling something not good. I hope someone points me for better design. regards.

Re: Find overlap
by ig (Vicar) on Oct 14, 2012 at 19:17 UTC

    Update: sorry, but this doesn't work. It misses many overlapping sets.

    update: the following might do what you want. It will take more memory and time than the above but it will find all overalpping sets.

    use strict; use warnings; use List::Util qw(min max); my $ranges148 = scan('148Nsorted.bed'); my $ranges162 = scan('162Nsorted.bed'); my $ranges174 = scan('174Nsorted.bed'); my $ranges175 = scan('175Nsorted.bed'); foreach my $chr (keys %$ranges148) { foreach my $range148 (@{$ranges148->{$chr}}) { foreach my $range162 (@{$ranges162->{$chr}}) { foreach my $range174 (@{$ranges174->{$chr}}) { foreach my $range175 (@{$ranges175->{$chr}}) { my $minstart = min( $range148->[0], $range162->[0], $range174->[0], $range175->[0]); my $maxstart = max( $range148->[0], $range162->[0], $range174->[0], $range175->[0]); my $minend = min( $range148->[1], $range162->[1], $range174->[1], $range175->[1]); my $maxend = max( $range148->[1], $range162->[1], $range174->[1], $range175->[1]); if($maxstart <= $minend) { print "overlapping $chr: $minstart, $maxend - +($range148->[0], $range148->[1]), ($range162->[0], $range162->[1]), ( +$range174->[0], $range174->[1]), ($range175->[0], $range175->[1])\n"; } else { print "not overlapping: $chr: $minstart, $maxe +nd - ($range148->[0], $range148->[1]), ($range162->[0], $range162->[1 +]), ($range174->[0], $range174->[1]), ($range175->[0], $range175->[1] +)\n"; } } } } } } exit(0); sub scan { my ($file) = @_; open(my $fh, '<', $file) or die "$file: $!"; my $ranges; foreach my $line (<$fh>) { chomp($line); my ($chr, $start, $end) = split(/\t/, $line); push(@{$ranges->{$chr}}, [ 0+$start, 0+$end ]); } return($ranges); }
      Hi, Thanks for your answer. I dont really get the final part. Why do you substract the ranges from maxend? Now it just prints: not overlapping: chr7: 41396, 22342744 - (41396, 41743), (41610, 41757), (884716, 884900), (22342598, 22342744). Not a number. Could you explain this part? Thanks!
        Oh im sorry i get it. you just print the ranges out the files.!