Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Finding Overlapping Regions on Genome

by oxydeepu (Novice)
on Jul 11, 2013 at 09:45 UTC ( #1043671=perlquestion: print w/ replies, xml ) Need Help??
oxydeepu has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a small problem, I am not able to find a logic to do it.

Question is: I have a file of exons like this where col1 is chromosome name col2 is orientation and rest are start and stop.

Contig0 + 127874 130761
Contig0 + 129936 129984
Contig0 + 130572 133438
Contig0 + 130573 130607
Contig0 + 130630 130761
Contig0 + 130732 130767
Contig0 + 130784 130818
Contig0 + 130832 130866
Contig0 + 130832 130867
Contig0 + 130893 130928
Contig0 + 130970 131004
Contig0 + 130982 131017

As this coordinates have overlapping

Contig0 + 127874 130761
Contig0 + 129936 129984
Contig0 + 130572 133438
Contig0 + 130573 130607
Contig0 + 130630 130761
Contig0 + 130732 130767

It will be

Contig0 + 127874 130767

then

Contig0 + 130784 130818

doesn't have any over lap. Then,

Contig0 + 130832 130866
Contig0 + 130832 130867

will be

Contig0 + 130832 130867

After that no overlap for

Contig0 + 130893 130928

and for

Contig0 + 130970 131004
Contig0 + 130982 131017

it will be

Contig0 + 130970 131017

So at last the result for the example block will be

Contig0 + 127874 130767
Contig0 + 130893 130928
Contig0 + 130970 131017

I hope the question is clear.
Can anyone please help me with this.
Thank you in advance,
Deepak

Comment on Finding Overlapping Regions on Genome
Re: Finding Overlapping Regions on Genome
by daxim (Chaplain) on Jul 11, 2013 at 10:00 UTC
Re: Finding Overlapping Regions on Genome
by choroba (Abbot) on Jul 11, 2013 at 10:06 UTC
    I do not get it. Why this line
    Contig0 + 130572 133438
    does not make it one huge range?
    Contig0 + 127874 133438
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Finding Overlapping Regions on Genome
by mtmcc (Hermit) on Jul 11, 2013 at 10:16 UTC
    Hi oxydeepu,

    Maybe it's just me, but I'm not fully clear on the question.

    Why is the first result you're looking for: Contig0 + 127874 130767, when the first block of numbers contains the contig: Contig0 + 130572 133438? Don't they overlap?

    Michael
Re: Finding Overlapping Regions on Genome
by kcott (Abbot) on Jul 11, 2013 at 10:23 UTC
Re: Finding Overlapping Regions on Genome
by oxydeepu (Novice) on Jul 11, 2013 at 11:13 UTC

    Hi all,

    I am so sorry. I wont repeat the mistake. I do understand the policy.

    The question is

    how to get the overlapping coordinates pairs if there any? If yes take the smallest start position and largest stop position from overlapping coordinates and if no overlap for a pair, report as it is?

    The example I gave was a small part of a huge file. I was manually doing it, that is why I missed some overlapping coordinates.

    I did write a code for it.

    here is the code which I wrote,

    open $fh,"genome_pred_exons.bed"; use List::Util qw( min max ); while(<$fh>) { chomp; @lh = split /\t/,$_; $ph{$lh[0]."\t".$lh[1]."\t".$lh[2]} .= $lh[3]."\t".$lh[4]."NNN +NNN"; } close($fh); foreach $k(sort keys %ph) { @c = sort { $a <=> $b } split ("NNNNNN",$ph{$k}); @g = (); foreach $m(@c) { @r = split /\t/,$m; if (@g == 0) { push @g, [$r[0]+0, $r[1]+0]; next; } else { for ($i = 0;$i <= $#g;$i++) { if(($r[0] + 0) <= ($g[$i][1] + 1)) { $g[$i][0] = min($g[$i][0]+0, $r[0]+0); $g[$i][1] = max($g[$i][1]+0, $r[1]+0); last; } elsif ( $i == $#g ) { push @g, [$r[0]+0, $r[1]+0]; last; } } } } } print Dumper \@g;

    This was an adapted script from internet. I changed it, but it did not give me any output.

    I hope this is a better way to ask than the last one.

    thank you, Deepak

Re: Finding Overlapping Regions on Genome
by mtmcc (Hermit) on Jul 11, 2013 at 11:47 UTC
    If I understand what you want correctly, this (not very elegant) script should work (provided your values are in ascending order of contig start site (column 3)):

    #!/usr/bin/perl use strict; use warnings; my $dataFile = $ARGV[0]; my $get = 1; my @array = (); my $col1 = 0; my $col2 = 0; my $low = 0; my $high = 0; open (FILE, "<", $dataFile); while (<FILE>) { next unless $_ =~ /[0-9]/; @array = split(" ", $_); if ($get == 1) { if (($array[2] > $high) && ($high > 0)) { print STDOUT "$col1\t$col2\t$low\t$high\n"; $low = $array[2]; $high = $array[3]; } $col1 = $array[0]; $col2 = $array[1]; $low = $array[2] if (($array[2] < $low) || ($low == 0) +); $high = $array[3] if (($array[3] > $high) || ($high == + 0));; $get = 0; } if ($array[2] <= $high) { $high = $array[3] if $array[3] > $high; } if (($array[2] > $high) || (eof(FILE))) { print STDOUT "$col1\t$col2\t$low\t$high\n"; $low = $array[2]; $high = $array[3]; $get = 1; } }

    If the chromosome changes during your large file, it would need to be modified to account for that.

    There's probably a much prettier way of doing it though...

    Michael

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1043671]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2014-12-20 11:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls