http://www.perlmonks.org?node_id=279587

aging acolyte has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,

More of a conceptual problem than a Perl problem per se.

I am parsing the output of a program (non-perl and not editable) that compares two DNA sequences. The results are given as sets of 2-D co-ordinates.

e.g.

[ ['NM_144963','2713','4091'], ['NM_144963','1949','2712'] ];
where field 1 is the name
field2 is the start coordinate
and field3 is the end coordinate

A simple sort allows me to sort the above and calculate the coverage

my @sorted = sort {($a->[1] <=>$b->[1]) || ($a->[2] <=> $b->[2])}@ +array; my $coverage = 0; for (@sorted){ $coverage += $_->[2] - $_->[1] +1; } printf "cover = %.1f%% \n", $coverage/$length *100;
All well and good for the above case. But parts of my data are overlapping e.g.
[ ['NM_176827','618','710'], ['NM_176827','621','710'], ['NM_176827','622','692'], ['NM_176827','629','710'] ]
Here my code breaks down and I cannot for the life of me think of a data structure that will allow adequately deal with both cases (i.e a collection of distinct hits, a collection of overlapping hits) or worse a combination of the two.

When they overlap I need to pick the longest hit. (i.e. the first array above).

Am I just having a brain metldown today - Can someone point me in the direction of an obvious solution. Thanks a lot

A.A.

Replies are listed 'Best First'.
Re: Comparing 2-D co-ordinates
by Zaxo (Archbishop) on Jul 31, 2003 at 14:30 UTC

    I think your data is better regarded as intervals in one dimension. Sort only on the start index. If you need to check for overlap, it can be done later, on the sorted data.

    After Compline,
    Zaxo

Re: Comparing 2-D co-ordinates
by fglock (Vicar) on Jul 31, 2003 at 14:47 UTC
    use Set::Infinite; my @array = ( ['NM_144963','2713','4091'], ['NM_144963','1949','2712'] ); my $set = Set::Infinite->new->integer; $set = $set->union( $_->[1], $_->[2] ) for @array; print "Set: $set, Size: ",$set->size," Length: ",$set->span->size,"\n" +; # Set: [1949..4091], Size: 2142 Length: 2142 @array = ( ['NM_176827','618','710'], ['NM_176827','621','710'], ['NM_176827','622','692'], ['NM_176827','629','710'] ); $set = Set::Infinite->new->integer; $set = $set->union( $_->[1], $_->[2] ) for @array; print "Set: $set, Size: ",$set->size," Length: ",$set->span->size,"\n" +; # Set: [618..710], Size: 92 Length: 92

    I'm not sure if this is what you want. Please let me know!

Re: Comparing 2-D co-ordinates
by BrowserUk (Patriarch) on Jul 31, 2003 at 14:52 UTC

    You can avoid the need to sort or even do any comparisons. Just loop through the data and build a string using one character to represent your ranges and another the 'freespace'. Then just count the non-freespace chars and your done.

    #! perl -slw use strict; my @data = map{ [ split "','", substr( $_, 1, -2 ) ] } <DATA>; my $mask = ' ' x 10_000; # Make a blank string that's 'big enough' my $max = 0; # we'll trim it back later for( @data ) { #calc the length of the range my $len = $_->[2] - $_->[1] +1; # overwrite the range with 'xx's substr( $mask, $_->[1], $len ) = 'x' x $len; # Remember the highest offset $max = $_->[2] if $_->[2] > $max; } $mask = substr( $mask, 0, $max ); # Trim to length my $coverage = $mask =~ tr[x][x]; # Count the 'x's printf "cover = %.1f%% \n", $coverage / length($mask) *100; __DATA__ 'NM_176827','618','710' 'NM_176827','621','710' 'NM_176827','622','692' 'NM_176827','629','710'

    Output

    P:\test>279587 cover = 13.0%

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      Im not sure I've understood your full specs. Whenever two matches overlap, do you want to discard the shorter? Even in this case?:
      <----> <-----> <---->
      Here the two shorter matches together give more coverage than the longest. Another structure to represent this data is a list in which each range occurs twice, a lexically sorted list of pairs, (start,end) and (end,start). But before choosing your data structure, you have to know what question you wan to answer. Chris Morris
        Chris,

        Thanks for helping clarify my problem. I want to find the set of discrete hits that gives me the best coverage (if that make sense?).

        In your example above (ah the beauty of acsii graphics) I would like to pick the two shorter examples as combined they give me a higher overall coverage.

        I apologise for any confusion that I am causing - I am having a bad day - too hot and sticky in here!

        A.A.

      Thanks for the suggestions,

      But both Browser's and fglock suggestions are greedy - ie they grab the maximum range. I want the largest discrete answer. To munge my data a little from above. If the data were:

      'NM_176827','620','710' 'NM_176827','621','710' 'NM_176827','618','692' 'NM_176827','629','710'
      The larget individual result is still number 1 but the range is still 618-710 (both the above return the range of coverage rather than "the" best result.

      Thanks again

      A.A.

        Sorry. I'm still misunderstanding you. Given this sample dataset, what 'best result' do you want?

        The index of the largest range?

        The %coverage by that largest range?

        Something else?

        Ie. What output do you want given that input, and how is it derived?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Re: Comparing 2-D co-ordinates
by Helter (Chaplain) on Jul 31, 2003 at 14:33 UTC
    In your first line of code, you are defining a subroutine to call to sort your data. You can do anything you want in a subroutine.

    So add the code that checks for overlaps (a bunch of compares for intersection) and if they overlap then determine which is longer and return that. As the sort goes through they should be sorted the way you want.

    It might be cleaner (from a code standpoint) to declare the subroutine separate, and then pass a refrence to sort.

    Hope this helps!
Re: Comparing 2-D co-ordinates
by jmanning2k (Pilgrim) on Jul 31, 2003 at 17:36 UTC
    After reading your comments through this thread that provide more details, I think I understand your problem.

    You're struggling with a particular perl problem, but I think it would be most helpful to take a step back and look at the problem as a whole.

    This sounds very Buddhist Monk-like, but what you want is 'A Golden Path'. A single path that covers the entire range, but uses the fewest overlapping pieces to get there.

    There are several well-understood algorithms for doing that. You might try the Graph module, which provides access to several different ways of analyzing this problem (See Graph::Base). Understanding this involves some pretty heavy mathematics, but you essentially create a line (1D, not 2D) along which you place your points as vector pairs (the start point and the end point) Then apply one of the available algorithms to find the shortest connecting path. The same sort of algorithms are used in many other applications. (such as networking to find the fastest route connecting two nodes.)

    Do a search on 'golden path algorithm' or some of the algorithms directly, such as 'dijkstra shortest path perl'. (That last one even has a few perl implementations, but the first one says 'use Graph instead'.)

    Developing your own algorithm is certainly possible, but this is a much more complex problem than it first appears to be. Might as well use a well thought out and tested algorithm - you just have to understand that the same method of finding the shortest distance between two network nodes also applies to finding the fewest non-overlapping clones on a tiling path.

    Sorry my reply is less perl and more theory, but it may be a better solution to your problem.

    Update:I just found Bio::Coordinate::Graph too.
Re: Comparing 2-D co-ordinates
by dragonchild (Archbishop) on Jul 31, 2003 at 15:33 UTC
    You're not working with 2-D coordinates ... you're working with 1-D coordinates. You're working on a line, not a plane. So, if you want to deal with finding the longest, calculate the length found.

    Now, for the overlap ... it sounds like you need to do something like the following:

    1. Calculate length for each coordinate pair.
    2. Loop through the pairs, keeping each pair that provides unique coverage. If the pair's start or end coordinate is within the coverage provided by a previously-seen pair, that spot is kept by the pair with the longest coverage.
    3. Once you have determined that you're replacing pair A with pair B, you have to make sure that you're catching all pairs that are eliminated by pair B. For example, if your current list of acceptable pairs is: (1,3), (4, 6) and you come across (2,5) ... both (1,3) and (4,6) would have to be replaced with (2,5). (Assuming, of course, that I've understood your spec.)
    If you want, I can help provide some code for you. /msg me if you'd like.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      I could be wrong but this doesn't quite solve the problem. See Re**4: Comparing 2-D co-ordinates. As a non-CS person I could be off-base but I'd call this a 1-D knapsack.

      --
      I'm not belgian but I play one on TV.

        Thanks,

        Not just for solving the problem but for pointing out that it is actually one of the class of NP-complete problems and so not making me feel too bad about not being able to come up with a simple solution.

        Now I think I just to to google for a solution.

        A.A.

      Close but no coconut

      Not that I am criticising - the problem is mine for not explaing the specifications clearly enough.

      A final stab at defining it: Given a set of 1-D coordinates how can I calculate the maximum coverage using discrete non-overlapping sets.

      e.g. for the set (1,3),(4,6),(2-5) the mximum coverage I can find is (1,3) + (4,6).

      Is that good enough or should I just stop digging?

      :-)

        Given that definition, belg4mit is absolutely right - this is a 1-D knapsack. Look in any upper-level algorithms book or just google "knapsack algorithm". My algorithm would have given you the smallest number of largest individual segements without taking into account the total coverage. (It can be easily extended to that by figuring out the all the overlaps for a given new segment, then comparing the total coverage between the new segment and the old segment(s). However, I would go with the algorithm that PhD's have optimized. If you cannot figure out how to write it, I'd be glad to provide an implementation of my naive version.)

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Comparing 2-D co-ordinates
by dragonchild (Archbishop) on Jul 31, 2003 at 20:19 UTC
    Maybe I'm recreating the wheel here, but wouldn't the following work for you?
    #!/usr/bin/perl my @data = ( [ 2, 8 ], [ 1, 3 ], [ 4, 6 ], [ 7, 10 ], [ 8, 20 ], ); foreach my $pair (@data) { @{$pair}[1,0] = @{$pair}[1,0] if $pair->[0] > $pair->[1]; $pair->[2] = $pair->[1] - $pair->[0]; } @data = sort { $a->[2] cmp $b->[2] || $a->[0] cmp $b->[0] || $a->[1] c +mp $b->[1] } @data; my @good; PAIR: foreach my $pair (@data) { my @check; foreach my $i (0 .. $#good) { my $good = $good[$i]; if ( ($good->[0] <= $pair->[0] && $pair->[0] < $goo +d->[1]) || ($good->[0] < $pair->[1] && $pair->[1] <= $goo +d->[1]) || ($pair->[0] <= $good->[0] && $good->[0] < $pai +r->[1]) || ($pair->[0] < $good->[1] && $good->[1] <= $pai +r->[1]) ) { push @check, [ $good, $i, ]; } } unless (@check) { push @good, $pair; next PAIR; } my $max = 0; $max += $_->[2] for map { $_->[0] } @check; next PAIR if $max > $pair->[2]; splice @good, $_, 1 for sort { $b <=> $a } map { $_->[1] } @ch +eck; push @good, $pair; } @good = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } @good; print "$_->[0] ... $_->[1] ($_->[2])\n" for @good; my $cover = 0; $cover += $_->[2] for @good; print "Total coverage is $cover\n";

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.