Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Millions of line segment intersection calcs: Looking for speed tips

by drewhead (Beadle)
on Aug 03, 2005 at 14:18 UTC ( #480481=perlquestion: print w/ replies, xml ) Need Help??
drewhead has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of X,Y coordinates on a 2D map. I'm trying to produce a set numbers (average, mean, stddev) of all edge lengths between neighbooring coordinates. An edge is defined as a line segemnt between two points. A neighboor is defined as an edge where no shorter edge that is defining a neighboor relationship from any possible combination of coordinates intersects it. [I think that's clear ;P]

The issue here is speed. This algorithm is O(N^3), and it doesn't take many coordinates to reach into the millions of caculations needed. For my example here, 301 coordinates, I'm finding I need 28882029 line segment intersection caculations. blech!

First I build all the unique edges. Then I loop through the edges sorted by length. Inside that I loop through all previously defined neighboor edges to determine if I have any intersection. If I don't intersect I push that edge onto the list of neigboors.

Here's a code example and a link to sample data:

example YAML dumped data: http://keep.drewhead.org/dump_301.dat

#!/usr/bin/perl ###################################################################### # CPAN Modules ###################################################################### use YAML qw(Load); use Data::Dumper; use Carp; use Benchmark; use strict; ###################################################################### # A few globals ###################################################################### my $points; # hash ref holding the raw data {'1' => [X,Y], '2' => [X,Y +], ...} my $edges; # Array ref to keep refrences to all edges [edge, X1, Y1, X +2, Y2] ... my $neighborEdges; # hash ref to hold sorted neighbor edges {'1' => ed +ge1, '2' => edge2, ...} my $precision = 7; my $delta = 10 ** (-$precision); ###################################################################### # # MAIN # ###################################################################### # Read in the X,Y coordinates from supplied data { local $/; open(my $fh, "dump_301.dat") or die "Unable to open file: $!\n"; $points = Load( <$fh> ); croak($@) if $@; close($fh); } my $start_pairs = Benchmark->new; my $edges = find_unique_edges($points); print "Unique edges ", scalar(@{$edges}), "\n"; print "Pairs parsed in :", timestr(timediff(Benchmark->new, $start_pai +rs)), "\n"; my $start_intersects = Benchmark->new; my $neighborEdges = find_unintersected_edges($edges); print "Unintersected edges ", scalar(keys %{$neighborEdges}), "\n"; print "Intersects parsed in :", timestr(timediff(Benchmark->new, $star +t_intersects)), "\n"; ###################################################################### ###################################################################### sub find_unique_edges { my $MAP = shift; my @edges; # returnable data structure my $start_distance = Benchmark->new; my $found = {}; # p1 is starting or anchor point of the line segment foreach my $p1 (@{$points}) { # p2 is end point of the line segment foreach my $p2 (@{$points}) { # We don't need to caculate if anchor and end are the same # or we have already seen these two pairs [reversed] before unless ( ($p1 == $p2) || $found->{$p2}->{$p1} ) { # Compute the edge my $edge = sqrt( ($p1->[0] - $p2->[0])**2 + ($p1->[1] - $p2->[1])**2 ); # Push the whole thing on a stack push (@edges, [ $edge, $p1->[0], $p1->[1], $p2->[0], $p2->[1] +]); # Keep track of the edges we've already computed (no need to d +o so twice) $found->{$p2}->{$p1} = 1; } } } return(\@edges); } ###################################################################### ###################################################################### sub find_unintersected_edges { my $PAIRS = shift; my $neighborEdges; # returnable data structure my $neighboorCNT = 1; foreach my $aref ( sort {$a->[0] <=> $b->[0] } @{$PAIRS}) { my $neighbor = 1; my $line_1_ref = []; push (@{$line_1_ref}, [$aref->[1],$aref->[2]]); # line 1 point A X +,Y push (@{$line_1_ref}, [$aref->[3],$aref->[4]]); # line 1 point B X +,Y for (my $n=1;$n<=$neighboorCNT;$n++) { if ($neighborEdges->{$n}) { my $points_ref; push (@{$points_ref}, @{$line_1_ref}); # line 1 points push (@{$points_ref}, [$neighborEdges->{$n}->[1],$neighborEdge +s->{$n}->[2]]); # line 2 point A X,Y push (@{$points_ref}, [$neighborEdges->{$n}->[3],$neighborEdge +s->{$n}->[4]]); # line 2 point B X,Y # If a intersect is found, set false and quit checking if (SegmentIntersection($points_ref)) { $neighbor = 0; last; } } } if ($neighbor) { $neighborEdges->{$neighboorCNT} = $aref; $neighboorCNT++; } } return($neighborEdges); } ###################################################################### +########## # # Shamlessly stolen from Math::Geometry::Planar sub SegmentIntersection { my $pointsref = $_[0]; my @points = @$pointsref; my @p1 = @{$points[0]}; # p1,p2 = segment 1 my @p2 = @{$points[1]}; my @p3 = @{$points[2]}; # p3,p4 = segment 2 my @p4 = @{$points[3]}; my $n1 = Determinant(($p3[0]-$p1[0]),($p3[0]-$p4[0]),($p3[1]-$p1[1]) +,($p3[1]-$p4[1])); my $n2 = Determinant(($p2[0]-$p1[0]),($p3[0]-$p1[0]),($p2[1]-$p1[1]) +,($p3[1]-$p1[1])); my $d = Determinant(($p2[0]-$p1[0]),($p3[0]-$p4[0]),($p2[1]-$p1[1]) +,($p3[1]-$p4[1])); if (abs($d) < $delta) { return 0; # parallel } if (!(($n1/$d < 1) && ($n2/$d < 1) && ($n1/$d > 0) && ($n2/$d > 0))) { return 0; } return(1); } ###################################################################### +########## # # The determinant for the matrix | x1 y1 | # | x2 y2 | # # args : x1,y1,x2,y2 # sub Determinant { my ($x1,$y1,$x2,$y2) = @_; return ($x1*$y2 - $x2*$y1); }

The correct answer is Unintersected edges 1778.

In this code I'm only trying to find the unique edges. Once I have that the statistics I need to caculate are trivial.
Also I took the code from Math::Geometry::Planar and tweaked the SegmentIntersection routine to return a boolean. As written it returned the intersection point which was more than I actually needed.

Devel::Profile tell me this:

%Time Sec. #calls sec/call F name 58.21 1889.8621 28882029 0.000065 main::SegmentIntersection 30.60 993.5924 1 993.592408 main::find_unintersected_edge +s 11.12 361.0416 86646087 0.000004 main::Determinant 0.05 1.7613 1 1.761285 main::find_unique_edges
So I'm spending a ton of time doing the math and looping through the edge lists. I'm looking for suggestions of things to speed up this algorithm, though I'm willing to entertain ideas of differing algorithms. :)

Comment on Millions of line segment intersection calcs: Looking for speed tips
Select or Download Code
Re: Millions of line segment intersection calcs: Looking for speed tips
by rvosa (Curate) on Aug 03, 2005 at 14:26 UTC
    Are you in a position to do some of the numbercrunching using Inline::C?
      I just knew this would be reply #1. :)

      The answer is sorta. There is nothing techinically that I'm aware of that would prevent using Inline::C. I lack knowledge of C, however that probally isn't really a good reason not to consider this. Yesterday infact I downloaded one of the tutorials and read through the 8 chapters in an effort to learn. I did get a program built that did the line segment caculation, I just havn't yet figured out how to pass data in and out of this yet (using Inline::C or not).

Re: Millions of line segment intersection calcs: Looking for speed tips
by rvosa (Curate) on Aug 03, 2005 at 14:39 UTC
    In this code block:
    # p1 is starting or anchor point of the line segment foreach my $p1 (@{$points}) { # p2 is end point of the line segment foreach my $p2 (@{$points}) { # We don't need to caculate if anchor and end are the same # or we have already seen these two pairs [reversed] before unless ( ($p1 == $p2) || $found->{$p2}->{$p1} ) { # Compute the edge my $edge = sqrt( ($p1->[0] - $p2->[0])**2 + ($p1->[1] - $p2->[1])**2 ); # Push the whole thing on a stack push (@edges, [ $edge, $p1->[0], $p1->[1], $p2->[0], $p2->[1] +]); # Keep track of the edges we've already computed (no need to d +o so twice) $found->{$p2}->{$p1} = 1; } } }
    Can you perhaps do:
    # outer loop goes through all points for my $i ( 0 .. $#{$points}) { # inner loop only through those not visited yet for my $j ( ( $i + 1 ) .. $#{$points}) { # Compute the edge my $edge = sqrt( ( $points->[$i][0] - $points->[$j][0] )**2 + ( $points->[$i][1] - $points->[$j][1] )**2 ); # Push the whole thing on a stack push (@edges, [ $edge, $points->[$i][0], $points->[$i][1], $points->[$j][0], $points->[$j][1] ] ); } } }
    I don't know enough about the problem space to be sure this is helpful or hurtful...
Re: Millions of line segment intersection calcs: Looking for speed tips
by itub (Priest) on Aug 03, 2005 at 14:43 UTC
    Maybe you can apply some of the ideas discussed in 352756, which dealt with finding bonds among a set of atoms in 3D space.
Re: Millions of line segment intersection calcs: Looking for speed tips
by extremely (Priest) on Aug 03, 2005 at 14:57 UTC
    You should be able to divide and conquer the data set. Rather than sorting the list by segment length, you should pick a dimension and sort on that.

    Sort all the points by X and then Y. Create all the edges so that the higher X is first. Sort all the edges by X1, then X2. When processing a segment against the list, you can skip any segment where Xn1 is less than X2 or Xn2 is greater than X1. Better still, you can exit the loop early once Xn1 is less than X2 since all the rest of the segments are certain to not intersect.

    The segment intersection calcs are expensive, adding a bail-out test for simple cases will go a long way by itself... but sorting the data to bail out of the entire list early should pay off handsomely as well.

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Millions of line segment intersection calcs: Looking for speed tips
by Eimi Metamorphoumai (Deacon) on Aug 03, 2005 at 15:04 UTC
    I suspect that there's probably a way to optimize the algorithm, but I really can't think of what it would be. However, there are numerous little changes that might help your performance a bit (though probably nothing too dramatic.
    • Why are $points and $neighborEdges hashrefs mapping integers (as strings) to values? It seems like it would be much more natural to make them simply arrayrefs (or just arrays). Then you wouldn't need neighboorCNT. So then instead of
      for (my $n=1;$n<=$neighboorCNT;$n++) { if ($neighborEdges->{$n}) {
      you get
      for my $n (@$neighborEdges){
      and when instead of using $neighborEdges->{$n} you'd just use $n. That would save a lot of hash accesses.
    • Very minor point, but it looks like you don't actually care about the real distance between points. Instead, you care about relative distances. So you can store the square of the distances (by not doing the sqrt) and get the same results.
    • Since Determinant is called so many times, I wonder whether you'd get any improvement over rewriting it without the temporary variables.
      sub Determinant { return ($_[0]*$_[3] - $_[2]*$_[1]); }
    • I'd probably try rethinking SegmentIntersection. Here's an unbenchmarked (and untested) version that tries to save on some of the divisions, save calls you don't need, etc. Basically, if $d is zero, we don't even need $n1 or $n2. For the inequalities, instead of dividing by $d again and again, we can multiply both sides of the inequality by $d (and flip the inequality if $d is less than zero).
      sub SegmentIntersection { my @points = @{$_[0]}; my @p1 = @{$points[0]}; # p1,p2 = segment 1 my @p2 = @{$points[1]}; my @p3 = @{$points[2]}; # p3,p4 = segment 2 my @p4 = @{$points[3]}; my $d = Determinant(($p2[0]-$p1[0]),($p3[0]-$p4[0]), ($p2[1]-$p1[1]),($p3[1]-$p4[1])); if (abs($d) < $delta) { return 0; # parallel } my $n1 = Determinant(($p3[0]-$p1[0]),($p3[0]-$p4[0]), ($p3[1]-$p1[1]),($p3[1]-$p4[1])); my $n2 = Determinant(($p2[0]-$p1[0]),($p3[0]-$p1[0]), ($p2[1]-$p1[1]),($p3[1]-$p1[1])); if ($d > 0){ return $n1 < $d && $n2 < $d && $n1 > 0 && $n2 > 0; } else { return $n1 > $d && $n2 > $d && $n1 < 0 && $n2 < 0; } }
    All small things, but they might help a little.
      Might also try inlining Determinant and some common sub-expression elimination...
      sub SegmentIntersection { my @points = @{$_[0]}; my @p1 = @{$points[0]}; # p1,p2 = segment 1 my @p2 = @{$points[1]}; my @p3 = @{$points[2]}; # p3,p4 = segment 2 my @p4 = @{$points[3]}; my $a = $p2[0] - $p1[0]; my $b = $p3[1] - $p4[1]; my $c = $p2[1] - $p1[1]; my $d = $p3[0] - $p4[0]; my $det = $a*$b - $c*$d; return 0 if (abs($det) < $delta); # parallel my $e = $p3[1]-$p1[1]; my $f = $p3[0]-$p1[0]; my $n1 = $f*$b - $e*$d my $n2 = $a*$e - $c*$f; if ($det > 0) { return $n1 < $det && $n2 < $det && $n1 > 0 && $n2 > 0; } else { return $n1 > $det && $n2 > $det && $n1 < 0 && $n2 < 0; } }
      Thanks for all your suggestions. I've considered all of them and here's what I found.
      Why are $points and $neighborEdges hashrefs mapping integers (as strings) to values?
      Big picture implementation: what we are talking about is one method in a larger OO style pm I'm building. $points gets populated elsewhere and it's eaiser to pass this around as a ref. Obvioulsy this isn't apparent in my example. However I really have no reason to keep neighborEdges that way and can change it. Impact here was marginal if any.
      but it looks like you don't actually care about the real distance between points. Instead, you care about relative distances. So you can store the square of the distances (by not doing the sqrt) and get the same results.
      I do care about the actual distances, but only for those edges I identify as neighboors. So this point is quite valid, I don't really need to sqrt until after I identify these. This isn't going to impact things sine the caculation of the edges takes a second vs 17ish minutes for the intersect loop.
      Since Determinant is called so many times, I wonder whether you'd get any improvement over rewriting it without the temporary variables.
      Tested, the answer appears to be no.
      I'd probably try rethinking SegmentIntersection.
      Ahhh! I love coming here and having people see all these things that make perfect send yet never occured to me. Ofcourse! This only has marginal impact. The test data leaves only 2240 Determinant calls out... so 1120 get dropped early.
Re: Millions of line segment intersection calcs: Looking for speed tips
by SimonClinch (Chaplain) on Aug 03, 2005 at 15:10 UTC
    The set of all possible connecting lines between two co-ordinates in such a set (irrespective of whether they are edges or not) has a cardinality of nC2. CPAN has a module for calculating all these combinations Math::Combinatorics which means that a set of 301 co-ordinates can produce an initial set of 45150 'candidate edges'.

    The bad news then is that there are millions of combinations of these connecting lines to test to see if they intersect.

    I can envisage a shortcut where at each iteration through these combinations, not only the candidate edges, but all their connection-test combinations are eliminated if an intersection occurs, leaving a rapidly diminishing set to test.

    The trick to that would be to set up a (bi-directionally) linked (by reference) list through both hashes (after using a hash for the candidate edges instead of an array) and walk down it doing your geometry test and if connection found, breaking deleting and relinking as you go. The reason for that is that a for loop cannot eliminate the queue ahead of itself, but a walk down a linked list can.

    One world, one people

Delaunay triangulation
by tmoertel (Chaplain) on Aug 03, 2005 at 15:14 UTC
    How does the set of "unintersected edges" computed by your algorithm differ from a Delaunay triangulation? If the difference is insignificant for your application (or can be quickly computed), you might be better off starting with a fast Delaunay triangulation (e.g., using Qhull) and adapting its output to your needs.

    Update: Take a look at Aurenhammer and Xu's "Optimal Triangulations". It points out that the greedy triangulation – what you are computing – is not necessarily the minimum-weight triangulation (MWT). The paper also notes, however, that for uniformly distributed point sets both the greedy and Delaunay triangulations yield satisfactory approximations to the MWT.

    Second update: The following code uses Qhull and computes an (approximate) DT-derived solution almost instantly:

    Example: solve dump_301.yml. Output:

    edge count = 889 average edge length = 123.835184709118 edge endpoints follow in YAML format --- - - 2590 - 3183 - - 2707 - 3257 --- - - 2560 - 2850 - - 2586 - 2869 ... other edges omitted ...

    Cheers,
    Tom

      I for one would agree you are right on both counts, but to paraphrase Shakespeare, 'Does a problem by any other name solve more sweetly?'

      Update: To clarify what I mean: the OP knows the problem and its solution but is looking for the fastest way to achieve it with perl.

      One world, one people

        To clarify what I mean: the OP knows the problem and its solution but is looking for the fastest way to achieve it with perl.

        How do you know what the OP knows and is looking for? Do you have some knowledge beyond what is in the original post? The OP says he is "willing to entertain ideas of differing algorithms." What makes you think otherwise?

        Cheers,
        Tom

        I tend to side with tmoertel. I once in ages past implemented my own naive algorithm for this very thing ( O(n^3) or O(n^4), probably the same way that the OP is doing, but in AutoLISP using AutoCAD), and although I knew there must be a better way, I didn't know at the time that just knowing the name Delaunay Triangulation would've helped in finding better algorithms (and it was long before Google and probably before Al Gore invented the internet). I've since been unable to find the time to implement this in perl :-(

        There was a start made on this in perl utilizing a C library, but all I see is a readme file.

      How does the set of "unintersected edges" computed by your algorithm differ from a Delaunay triangulation?
      That's a good question. I'd never heard of Delaunay triangulation until your post, so I'm going to need to do some reading. If the answer is that it doesn't or maybe if it's even a better way to show what I'm really after (the idea of remoteness of a set of coordinates... vague enough? ;-) I may just go this route.

      I admit to being somewhat of a perl biggot, however; speed in this case is really more important than my love affair with perl. This code will be triggered by a web app submittal of a file with the coords in it. Infact, speed is probally more important than finite accuracy of the stats I'm trying to produce. The end result will be used to relativally diffrenciate sets of coordinates. Or I should say it is my hope that the people who will use these number see them that way. :)

      Thanks for the links and the example. I've grabbed qhull and installed it so I can play around in the next few days. I'll update when I learn some more so the next guy that does a search.

        If you end up implementing Delaunay Triangulation, you may as well throw in Voronoi diagrams :-)
Re: Millions of line segment intersection calcs: Looking for speed tips (80% saved)
by BrowserUk (Pope) on Aug 03, 2005 at 16:18 UTC

    I think that you are checking twice as many edges as you need to.

    In your reversed edge test you are checking for {$p2}{$p1}

    unless ( ($p1 == $p2) || $found->{$p2}->{$p1} ) {

    But you are also setting the found edges with the pairings reversed.

    $found->{$p2}->{$p1} = 1;

    Which means that you are never detecting a reversed pair. Changing the setting code to

    $found->{$p1}->{$p2} = 1;

    means that when that pair comes up reversed, your test (as is) will detect them.

    That step alone cuts the number of edges in half (as you intended).

    You can save a little more there by using a hash instead of a hash ref, and concatenating the pairs so you use a single level hash, rather than a hash of zillions of little hashes. Ie.

    my %found; ## instead of my $found = {}; ... unless ( ($p1 == $p2) || $found{ "$p2|$p1" } ) { ... $found->{ "$p1|$p2" } = 1;

    Less indirection and less for the garbage collector to do.

    Similar saving can be gained by using arrays rather than array references elsewhere, particularly in the building and passing and unwrapping of the paramaters to the SegmentIntersection() sub. You use many levels of indirection build an array ref to an array of arrays, then pass the array ref in to the sub where you have then to unwrap all the nesting and copy them to local vars. Removing several layers of indirection gained several percent performance--and made the code clearer.

    Finally, little more can be gained by avoiding copying the parameters in the Determinant() sub. Also using integer math there, assuming that all your coordinates are less than 65536, this may gain a little.

    <I am getting different results, but I beleive this is because you were producing reversed duplicates.

    Assuming you agree that these results are correct, then the the following code runs in around 1/5th the time of your original.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      I think that you are checking twice as many edges as you need to.

      I think you are right. My DT-based implementation computed 889 output edges for the for the dump_301 data. That's exactly half of the OP's expected 1778. Perhaps the OP's output comprised two equivalent sets of edges differing only in direction.

      That step alone cuts the number of edges in half (as you intended).
      Gotta love it when you have the right idea and botch it's implementation. :) Thanks, you are exactly correct. That drops things down into the 3 minute range for the example.

      My praticle upper bounds at the moment is 3000. I did some test on a 2001 (8 trillion + calcs) coord map with the old code and after 56 hours I hand't returned ... doh! Lets see if this will return now.

      Next I'm going to work on the suggestions extremely made and see what I can get this down to.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://480481]
Approved by Limbic~Region
Front-paged by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (17)
As of 2014-10-01 15:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (29 votes), past polls