drewhead has asked for the
wisdom of the Perl Monks concerning the following question:
I have a set of X,Y coordinates on a 2D map. I'm trying to produce a set numbers (average, mean, stddev) of all edge lengths between neighbooring coordinates. An edge is defined as a line segemnt between two points. A neighboor is defined as an edge where no shorter edge that is defining a neighboor relationship from any possible combination of coordinates intersects it. [I think that's clear ;P]
The issue here is speed. This algorithm is O(N^3), and it doesn't take many coordinates to reach into the millions of caculations needed. For my example here, 301 coordinates, I'm finding I need 28882029 line segment intersection caculations. blech!
First I build all the unique edges. Then I loop through the edges sorted by length. Inside that I loop through all previously defined neighboor edges to determine if I have any intersection. If I don't intersect I push that edge onto the list of neigboors.
Here's a code example and a link to sample data:
example YAML dumped data: http://keep.drewhead.org/dump_301.dat
#!/usr/bin/perl
######################################################################
# CPAN Modules
######################################################################
use YAML qw(Load);
use Data::Dumper;
use Carp;
use Benchmark;
use strict;
######################################################################
# A few globals
######################################################################
my $points; # hash ref holding the raw data {'1' => [X,Y], '2' => [X,Y
+], ...}
my $edges; # Array ref to keep refrences to all edges [edge, X1, Y1, X
+2, Y2] ...
my $neighborEdges; # hash ref to hold sorted neighbor edges {'1' => ed
+ge1, '2' => edge2, ...}
my $precision = 7;
my $delta = 10 ** ($precision);
######################################################################
#
# MAIN
#
######################################################################
# Read in the X,Y coordinates from supplied data
{
local $/;
open(my $fh, "dump_301.dat") or die "Unable to open file: $!\n";
$points = Load( <$fh> );
croak($@) if $@;
close($fh);
}
my $start_pairs = Benchmark>new;
my $edges = find_unique_edges($points);
print "Unique edges ", scalar(@{$edges}), "\n";
print "Pairs parsed in :", timestr(timediff(Benchmark>new, $start_pai
+rs)), "\n";
my $start_intersects = Benchmark>new;
my $neighborEdges = find_unintersected_edges($edges);
print "Unintersected edges ", scalar(keys %{$neighborEdges}), "\n";
print "Intersects parsed in :", timestr(timediff(Benchmark>new, $star
+t_intersects)), "\n";
######################################################################
######################################################################
sub find_unique_edges
{
my $MAP = shift;
my @edges; # returnable data structure
my $start_distance = Benchmark>new;
my $found = {};
# p1 is starting or anchor point of the line segment
foreach my $p1 (@{$points}) {
# p2 is end point of the line segment
foreach my $p2 (@{$points}) {
# We don't need to caculate if anchor and end are the same
# or we have already seen these two pairs [reversed] before
unless ( ($p1 == $p2)  $found>{$p2}>{$p1} ) {
# Compute the edge
my $edge = sqrt(
($p1>[0]  $p2>[0])**2 +
($p1>[1]  $p2>[1])**2
);
# Push the whole thing on a stack
push (@edges, [ $edge, $p1>[0], $p1>[1], $p2>[0], $p2>[1]
+]);
# Keep track of the edges we've already computed (no need to d
+o so twice)
$found>{$p2}>{$p1} = 1;
}
}
}
return(\@edges);
}
######################################################################
######################################################################
sub find_unintersected_edges
{
my $PAIRS = shift;
my $neighborEdges; # returnable data structure
my $neighboorCNT = 1;
foreach my $aref ( sort {$a>[0] <=> $b>[0] } @{$PAIRS}) {
my $neighbor = 1;
my $line_1_ref = [];
push (@{$line_1_ref}, [$aref>[1],$aref>[2]]); # line 1 point A X
+,Y
push (@{$line_1_ref}, [$aref>[3],$aref>[4]]); # line 1 point B X
+,Y
for (my $n=1;$n<=$neighboorCNT;$n++) {
if ($neighborEdges>{$n}) {
my $points_ref;
push (@{$points_ref}, @{$line_1_ref}); # line 1 points
push (@{$points_ref}, [$neighborEdges>{$n}>[1],$neighborEdge
+s>{$n}>[2]]); # line 2 point A X,Y
push (@{$points_ref}, [$neighborEdges>{$n}>[3],$neighborEdge
+s>{$n}>[4]]); # line 2 point B X,Y
# If a intersect is found, set false and quit checking
if (SegmentIntersection($points_ref)) {
$neighbor = 0;
last;
}
}
}
if ($neighbor) {
$neighborEdges>{$neighboorCNT} = $aref;
$neighboorCNT++;
}
}
return($neighborEdges);
}
######################################################################
+##########
#
# Shamlessly stolen from Math::Geometry::Planar
sub SegmentIntersection {
my $pointsref = $_[0];
my @points = @$pointsref;
my @p1 = @{$points[0]}; # p1,p2 = segment 1
my @p2 = @{$points[1]};
my @p3 = @{$points[2]}; # p3,p4 = segment 2
my @p4 = @{$points[3]};
my $n1 = Determinant(($p3[0]$p1[0]),($p3[0]$p4[0]),($p3[1]$p1[1])
+,($p3[1]$p4[1]));
my $n2 = Determinant(($p2[0]$p1[0]),($p3[0]$p1[0]),($p2[1]$p1[1])
+,($p3[1]$p1[1]));
my $d = Determinant(($p2[0]$p1[0]),($p3[0]$p4[0]),($p2[1]$p1[1])
+,($p3[1]$p4[1]));
if (abs($d) < $delta) {
return 0; # parallel
}
if (!(($n1/$d < 1) && ($n2/$d < 1) &&
($n1/$d > 0) && ($n2/$d > 0))) {
return 0;
}
return(1);
}
######################################################################
+##########
#
# The determinant for the matrix  x1 y1 
#  x2 y2 
#
# args : x1,y1,x2,y2
#
sub Determinant {
my ($x1,$y1,$x2,$y2) = @_;
return ($x1*$y2  $x2*$y1);
}
The correct answer is Unintersected edges 1778.
In this code I'm only trying to find the unique edges. Once I have that the statistics I need to caculate are trivial.
Also I took the code from Math::Geometry::Planar and tweaked the SegmentIntersection routine to return a boolean. As written it returned the intersection point which was more than I actually needed.
Devel::Profile tell me this:
%Time Sec. #calls sec/call F name
58.21 1889.8621 28882029 0.000065 main::SegmentIntersection
30.60 993.5924 1 993.592408 main::find_unintersected_edge
+s
11.12 361.0416 86646087 0.000004 main::Determinant
0.05 1.7613 1 1.761285 main::find_unique_edges
So I'm spending a ton of time doing the math and looping through the edge lists.
I'm looking for suggestions of things to speed up this algorithm, though I'm willing to entertain ideas of differing algorithms. :)
Re: Millions of line segment intersection calcs: Looking for speed tips by rvosa (Curate) on Aug 03, 2005 at 14:26 UTC 
Are you in a position to do some of the numbercrunching using Inline::C?  [reply] 

I just knew this would be reply #1. :)
The answer is sorta. There is nothing techinically that I'm aware of that would prevent using Inline::C. I lack knowledge of C, however that probally isn't really a good reason not to consider this. Yesterday infact I downloaded one of the tutorials and read through the 8 chapters in an effort to learn. I did get a program built that did the line segment caculation, I just havn't yet figured out how to pass data in and out of this yet (using Inline::C or not).
 [reply] 
Re: Millions of line segment intersection calcs: Looking for speed tips by rvosa (Curate) on Aug 03, 2005 at 14:39 UTC 
# p1 is starting or anchor point of the line segment
foreach my $p1 (@{$points}) {
# p2 is end point of the line segment
foreach my $p2 (@{$points}) {
# We don't need to caculate if anchor and end are the same
# or we have already seen these two pairs [reversed] before
unless ( ($p1 == $p2)  $found>{$p2}>{$p1} ) {
# Compute the edge
my $edge = sqrt(
($p1>[0]  $p2>[0])**2 +
($p1>[1]  $p2>[1])**2
);
# Push the whole thing on a stack
push (@edges, [ $edge, $p1>[0], $p1>[1], $p2>[0], $p2>[1]
+]);
# Keep track of the edges we've already computed (no need to d
+o so twice)
$found>{$p2}>{$p1} = 1;
}
}
}
Can you perhaps do:
# outer loop goes through all points
for my $i ( 0 .. $#{$points}) {
# inner loop only through those not visited yet
for my $j ( ( $i + 1 ) .. $#{$points}) {
# Compute the edge
my $edge = sqrt(
( $points>[$i][0]  $points>[$j][0] )**2 +
( $points>[$i][1]  $points>[$j][1] )**2
);
# Push the whole thing on a stack
push (@edges, [
$edge,
$points>[$i][0],
$points>[$i][1],
$points>[$j][0],
$points>[$j][1]
]
);
}
}
}
I don't know enough about the problem space to be sure this is helpful or hurtful...  [reply] [d/l] [select] 
Re: Millions of line segment intersection calcs: Looking for speed tips by itub (Priest) on Aug 03, 2005 at 14:43 UTC 
Maybe you can apply some of the ideas discussed in 352756, which dealt with finding bonds among a set of atoms in 3D space.  [reply] 
Re: Millions of line segment intersection calcs: Looking for speed tips by extremely (Priest) on Aug 03, 2005 at 14:57 UTC 
You should be able to divide and conquer the data set. Rather than sorting the list by segment length, you should pick a dimension and sort on that.
Sort all the points by X and then Y. Create all the edges so that the higher X is first. Sort all the edges by X1, then X2. When processing a segment against the list, you can skip any segment where Xn1 is less than X2 or Xn2 is greater than X1. Better still, you can exit the loop early once Xn1 is less than X2 since all the rest of the segments are certain to not intersect.
The segment intersection calcs are expensive, adding a bailout test for simple cases will go a long way by itself... but sorting the data to bail out of the entire list early should pay off handsomely as well.

$you = new YOU;
honk() if $you>love(perl)
 [reply] 
Re: Millions of line segment intersection calcs: Looking for speed tips by Eimi Metamorphoumai (Deacon) on Aug 03, 2005 at 15:04 UTC 
I suspect that there's probably a way to optimize the algorithm, but I really can't think of what it would be. However, there are numerous little changes that might help your performance a bit (though probably nothing too dramatic.
 Why are $points and $neighborEdges hashrefs mapping integers (as strings) to values? It seems like it would be much more natural to make them simply arrayrefs (or just arrays). Then you wouldn't need neighboorCNT. So then instead of
for (my $n=1;$n<=$neighboorCNT;$n++) {
if ($neighborEdges>{$n}) {
you get
for my $n (@$neighborEdges){
and when instead of using $neighborEdges>{$n} you'd just use $n. That would save a lot of hash accesses.
 Very minor point, but it looks like you don't actually care about the real distance between points. Instead, you care about relative distances. So you can store the square of the distances (by not doing the sqrt) and get the same results.
 Since Determinant is called so many times, I wonder whether you'd get any improvement over rewriting it without the temporary variables.
sub Determinant {
return ($_[0]*$_[3]  $_[2]*$_[1]);
}
 I'd probably try rethinking SegmentIntersection. Here's an unbenchmarked (and untested) version that tries to save on some of the divisions, save calls you don't need, etc. Basically, if $d is zero, we don't even need $n1 or $n2. For the inequalities, instead of dividing by $d again and again, we can multiply both sides of the inequality by $d (and flip the inequality if $d is less than zero).
sub SegmentIntersection {
my @points = @{$_[0]};
my @p1 = @{$points[0]}; # p1,p2 = segment 1
my @p2 = @{$points[1]};
my @p3 = @{$points[2]}; # p3,p4 = segment 2
my @p4 = @{$points[3]};
my $d = Determinant(($p2[0]$p1[0]),($p3[0]$p4[0]),
($p2[1]$p1[1]),($p3[1]$p4[1]));
if (abs($d) < $delta) {
return 0; # parallel
}
my $n1 = Determinant(($p3[0]$p1[0]),($p3[0]$p4[0]),
($p3[1]$p1[1]),($p3[1]$p4[1]));
my $n2 = Determinant(($p2[0]$p1[0]),($p3[0]$p1[0]),
($p2[1]$p1[1]),($p3[1]$p1[1]));
if ($d > 0){
return $n1 < $d && $n2 < $d && $n1 > 0 && $n2 > 0;
} else {
return $n1 > $d && $n2 > $d && $n1 < 0 && $n2 < 0;
}
}
All small things, but they might help a little.  [reply] [d/l] [select] 

Might also try inlining Determinant and some common subexpression elimination...
sub SegmentIntersection {
my @points = @{$_[0]};
my @p1 = @{$points[0]}; # p1,p2 = segment 1
my @p2 = @{$points[1]};
my @p3 = @{$points[2]}; # p3,p4 = segment 2
my @p4 = @{$points[3]};
my $a = $p2[0]  $p1[0];
my $b = $p3[1]  $p4[1];
my $c = $p2[1]  $p1[1];
my $d = $p3[0]  $p4[0];
my $det = $a*$b  $c*$d;
return 0 if (abs($det) < $delta); # parallel
my $e = $p3[1]$p1[1];
my $f = $p3[0]$p1[0];
my $n1 = $f*$b  $e*$d
my $n2 = $a*$e  $c*$f;
if ($det > 0)
{
return $n1 < $det && $n2 < $det && $n1 > 0 && $n2 > 0;
}
else
{
return $n1 > $det && $n2 > $det && $n1 < 0 && $n2 < 0;
}
}
 [reply] [d/l] 

Thanks for all your suggestions. I've considered all of them and here's what I found.
Why are $points and $neighborEdges hashrefs mapping integers (as strings) to values?
Big picture implementation: what we are talking about is one method in a larger OO style pm I'm building. $points gets populated elsewhere and it's eaiser to pass this around as a ref. Obvioulsy this isn't apparent in my example. However I really have no reason to keep neighborEdges that way and can change it. Impact here was marginal if any.
but it looks like you don't actually care about the real distance between points. Instead, you care about relative distances. So you can store the square of the distances (by not doing the sqrt) and get the same results.
I do care about the actual distances, but only for those edges I identify as neighboors. So this point is quite valid, I don't really need to sqrt until after I identify these. This isn't going to impact things sine the caculation of the edges takes a second vs 17ish minutes for the intersect loop.
Since Determinant is called so many times, I wonder whether you'd get any improvement over rewriting it without the temporary variables.
Tested, the answer appears to be no.
I'd probably try rethinking SegmentIntersection.
Ahhh! I love coming here and having people see all these things that make perfect send yet never occured to me. Ofcourse! This only has marginal impact. The test data leaves only 2240 Determinant calls out... so 1120 get dropped early.
 [reply] 
Re: Millions of line segment intersection calcs: Looking for speed tips by SimonClinch (Chaplain) on Aug 03, 2005 at 15:10 UTC 
The set of all possible connecting lines between two coordinates in such a set (irrespective of whether they are edges or not) has a cardinality of nC2. CPAN has a module for calculating all these combinations Math::Combinatorics which means that a set of 301 coordinates can produce an initial set of 45150 'candidate edges'.
The bad news then is that there are millions of combinations of these connecting lines to test to see if they intersect.
I can envisage a shortcut where at each iteration through these combinations, not only the candidate edges, but all their connectiontest combinations are eliminated if an intersection occurs, leaving a rapidly diminishing set to test. The trick to that would be to set up a (bidirectionally) linked (by reference) list through both hashes (after using a hash for the candidate edges instead of an array) and walk down it doing your geometry test and if connection found, breaking deleting and relinking as you go. The reason for that is that a for loop cannot eliminate the queue ahead of itself, but a walk down a linked list can.
 [reply] 
Delaunay triangulation by tmoertel (Chaplain) on Aug 03, 2005 at 15:14 UTC 
How does the set of "unintersected edges" computed by your algorithm differ from a Delaunay triangulation? If the difference is insignificant for your application (or can be quickly computed), you might be better off starting with a fast Delaunay triangulation (e.g., using Qhull) and adapting its output to your needs.
Update: Take a look at Aurenhammer and Xu's "Optimal Triangulations". It points out that the greedy triangulation – what you are computing – is not necessarily the minimumweight triangulation (MWT). The paper also notes, however, that for uniformly distributed point sets both the greedy and Delaunay triangulations yield satisfactory approximations to the MWT.
Second update: The following code uses Qhull and computes an (approximate) DTderived solution almost instantly:
Example: solve dump_301.yml. Output:
edge count = 889
average edge length = 123.835184709118
edge endpoints follow in YAML format


 2590
 3183

 2707
 3257


 2560
 2850

 2586
 2869
... other edges omitted ...
Cheers, Tom
 [reply] [d/l] [select] 

 [reply] 

Verily, the problem doth solve more sweetly. For by a common name it is well studied, and we may use the sweet fruits thereof in our pursuits.
 [reply] 

 [reply] 

 [reply] 

How does the set of "unintersected edges" computed by your algorithm differ from a Delaunay triangulation?
That's a good question. I'd never heard of Delaunay triangulation until your post, so I'm going to need to do some reading. If the answer is that it doesn't or maybe if it's even a better way to show what I'm really after (the idea of remoteness of a set of coordinates... vague enough? ;) I may just go this route.
I admit to being somewhat of a perl biggot, however; speed in this case is really more important than my love affair with perl. This code will be triggered by a web app submittal of a file with the coords in it. Infact, speed is probally more important than finite accuracy of the stats I'm trying to produce. The end result will be used to relativally diffrenciate sets of coordinates. Or I should say it is my hope that the people who will use these number see them that way. :)
Thanks for the links and the example. I've grabbed qhull and installed it so I can play around in the next few days. I'll update when I learn some more so the next guy that does a search.
 [reply] 

If you end up implementing Delaunay Triangulation, you may as well throw in Voronoi diagrams :)
 [reply] 
Re: Millions of line segment intersection calcs: Looking for speed tips (80% saved) by BrowserUk (Pope) on Aug 03, 2005 at 16:18 UTC 
I think that you are checking twice as many edges as you need to.
In your reversed edge test you are checking for {$p2}{$p1}
unless ( ($p1 == $p2)  $found>{$p2}>{$p1} ) {
But you are also setting the found edges with the pairings reversed.
$found>{$p2}>{$p1} = 1;
Which means that you are never detecting a reversed pair. Changing the setting code to
$found>{$p1}>{$p2} = 1;
means that when that pair comes up reversed, your test (as is) will detect them.
That step alone cuts the number of edges in half (as you intended).
You can save a little more there by using a hash instead of a hash ref, and concatenating the pairs so you use a single level hash, rather than a hash of zillions of little hashes. Ie.
my %found; ## instead of my $found = {};
...
unless ( ($p1 == $p2)  $found{ "$p2$p1" } ) {
...
$found>{ "$p1$p2" } = 1;
Less indirection and less for the garbage collector to do.
Similar saving can be gained by using arrays rather than array references elsewhere, particularly in the building and passing and unwrapping of the paramaters to the SegmentIntersection() sub. You use many levels of indirection build an array ref to an array of arrays, then pass the array ref in to the sub where you have then to unwrap all the nesting and copy them to local vars. Removing several layers of indirection gained several percent performanceand made the code clearer.
Finally, little more can be gained by avoiding copying the parameters in the Determinant() sub. Also using integer math there, assuming that all your coordinates are less than 65536, this may gain a little.
<I am getting different results, but I beleive this is because you were producing reversed duplicates.
Assuming you agree that these results are correct, then the the following code runs in around 1/5th the time of your original.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] [select] 

 [reply] 

That step alone cuts the number of edges in half (as you intended).
Gotta love it when you have the right idea and botch it's implementation. :) Thanks, you are exactly correct. That drops things down into the 3 minute range for the example.
My praticle upper bounds at the moment is 3000. I did some test on a 2001 (8 trillion + calcs) coord map with the old code and after 56 hours I hand't returned ... doh! Lets see if this will return now.
Next I'm going to work on the suggestions extremely made and see what I can get this down to.
 [reply] 

