vroom has asked for the wisdom of the Perl Monks concerning the following question:
I have an application which tracks complaints based on location. The X and Y for the problem location is then stored. However many complaints could be placed for one problem. Before a work order is made it would be nice to be able to find all items that fall within a similar geographical area and then associate all of those complaints with the newly created work order. The goal would be to find clumps of problems and list them together. It's straight forward to find all the items that fall within X feet of a given location. Doing this for all the possible combinations within the dataset is less desirable.
The goal would be to find a decent solution that isn't
too computationally intensive to identify and list these
clumps of complaints.
Some thoughts I had:
 Sort the list by x and sort the list by y. Then come up with a method for merging the lists w/o listing a given complaint more than once. Downside: Identifying a method for merging the results that doesn't result in too many misses
 Round the X and Y to the nearest N feet then sort by X concatenated with Y. This finds all items which fall within a given square on the grid. Downside: This fails to notice items which are just over the border from each other.
Has anyone had any experience with a similar problem or have an approach that they think would be useful?
Thanks,
Tim
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by DamnDirtyApe (Curate) on Jul 17, 2002 at 23:22 UTC

#! /usr/bin/perl
use strict ;
use warnings ;
my @complaints = (
{
'complaint_id' => 'a123',
'x' => '45.2',
'y' => '39.7'
},
{
'complaint_id' => 'b456',
'x' => '79.3',
'y' => '42.0'
},
{
'complaint_id' => 'c789',
'x' => '11.9',
'y' => '29.8'
},
{
'complaint_id' => 'd863',
'x' => '95.3',
'y' => '17.2'
},
{
'complaint_id' => 'e635',
'x' => '65.5',
'y' => '33.3'
},
) ;
my ( $curr_x, $curr_y ) = ( 36.2, 47.3 ) ;
my @sorted_by_proximity = map { $_>[1]>{'dist'} = $_>[0]; $_>[1] }
sort { $a>[0] <=> $b>[0] }
map { [ distance( $curr_x, $curr_y,
$_>{'x'}, $_>{'y'}
), $_
]
} @complaints ;
foreach ( @sorted_by_proximity )
{
print $_>{'complaint_id'} . " is " . $_>{'dist'} . " away.\n" ;
}
sub distance
{
my ($x1, $y1, $x2, $y2 ) = @_ ;
return sqrt( abs( $x2  $x1 )**2 + abs( $y2  $y1 )**2 ) ;
}
This gives you an array ref of complaints, sorted closestfirst. That way, you could take the first n complaints, or all the complaints until the distance is greater than x, etc.
Update: Thanks to dws and hossman both correct.
Here's mv revised code which may be a little more useful.
#! /usr/bin/perl
use strict ;
use warnings ;
# Define some rules and constants.
my $CLUSTER_RADIUS = 10 ; # Max distance to be included in cluster.
my $CENTER_SKIP = 5 ; # Distance between centers on cluster scans
+.
my $CLUSTER_COUNT = 4 ; # Minimum no. of complaints to make a clust
+er.
my $START_X = 0 ;
my $END_X = 100 ;
my $START_Y = 0 ;
my $END_Y = 100 ;
# Load the complaints.
my @complaints = () ;
while ( <DATA> )
{
next if /^$/ ;
my $hashref = {} ;
( $hashref>{'id'}, $hashref>{'x'}, $hashref>{'y'} ) = split ;
push @complaints, $hashref ;
}
for ( my $curr_x = $START_X ;
$curr_x <= $END_X ;
$curr_x += $CENTER_SKIP )
{
for ( my $curr_y = $START_Y ;
$curr_y <= $END_Y ;
$curr_y += $CENTER_SKIP )
{
my @near = map { $_>[1]>{'dist'} = $_>[0] ; $_>[1] }
grep { $_>[0] <= $CLUSTER_RADIUS }
map { [ distance( $curr_x, $curr_y,
$_>{'x'}, $_>{'y'}
), $_
]
} @complaints ;
my $complaint_count = @near ;
if ( $complaint_count >= $CLUSTER_COUNT )
{
printf "%d complaints near (%d,%d).\n",
$complaint_count, $curr_x, $curr_y ;
}
}
}
exit 0 ;
#
sub distance
{
my ($x1, $y1, $x2, $y2 ) = @_ ;
return sqrt( ( $x2  $x1 )**2 + ( $y2  $y1 )**2 ) ;
}
__DATA__
a123 10.1 12.5
b124 12.5 8.6
c125 9.8 10.2
d213 24.6 19.5
e753 35.6 2.2
f854 76.5 46.7
g354 8.7 8.6
_______________
D
a
m
n
D
i
r
t
y
A
p
e
Home Node

Email
 [reply] [d/l] [select] 

return sqrt( abs( $x2  $x1 )**2 + abs( $y2  $y1 )**2 ) ;
The abs() isn't necessary, since the result of squaring the difference will always be positive.
For purposes of ordering pairs by distance, it is sufficient to sum the squared differences and sort on distance^2, deferring the sqrt() until a true distance is necessary. (I picked up this trick from Jon Bentley's "Writing Efficient Programs," which is out of print, but worth grabbing if you find a copy in a used bookstore.)
You can also exclude from consideration, before sorting, any pair that has a difference in either dimension that is > X. Such pairs will be excluded anyway, so why clog up the sort with them.
 [reply] [d/l] 

that assumes a $curr_x and a $curr_y ... my understanding of the question
was to find clusters of complaints  not just the problems near a given location.
 [reply] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by dws (Chancellor) on Jul 17, 2002 at 23:51 UTC

One simple approach to geographic searching is to precalculate a "bucket" for each coordinate, and then organize the data such that the bucket number can be used as a key to quickly lookup all coordinates that fall within that bucket bucket. This quickly culls down the search space, allowing you to make the pairwise distance calculation on a subset of the data.
The initial filtering to find all points within X feet of a given point by start by considering how many additional buckets along one vector (say, along the X axis) need to be considered such that the far side of one bucket to the far side of the end bucket is >= X feet. This number forms an integral radius for a sweep around the bucket space (or, more simply, to determine the square grid of buckets to be searched).
If you bucket sizes selected such that most searches can be done by considering a bucket and its immediate neighbors, the SQL query is pretty simple. Prepare the query
select * from complaint where bucket in (?,?,?,?,?,?,?,?,?)
and then plug in the bucket ids when you execute.
If you need to consider more buckets, write the bucket numbers into a temporary table, using that table to JOIN against the complaint table.
Then perform the distance calculation on whatever comes back, and sort accordingly.
 [reply] [d/l] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by hossman (Prior) on Jul 18, 2002 at 00:55 UTC

Questions lke this tend to require some defintion
of "proximity" which is rarely as simple as "N points within
a polygon or area M" ... typically people want to find
trends  3 problems within 1 grid square is a trend,
1 is not  BUT! 1 per grid square for 10x10 grid squares
is.
An algorithm that seems like it would work well,
assumes that you have a range of "region sizes" that can
overlap, and difffernet thresolds for the number of problems
that constitute a "hot spot region". In psuedo code...
%grid_size_thresholds = (
1 => 3, # a 1x1 square grid is hot if it has 3 probs
2 => 10,# a 2x2 square grid is hot if it has 10 probs
3 => 30,# ...
...
)
@hotspots = ();
foreach $size (sort keys %grid_size_thresholds) {
for ($x = 0; $x + $size < $GRID_WIDTH; $x++) {
for ($y = 0; $y + $size < $GRID_HEIGHT; $y++) {
$region = new Region($x, $x + $size,
$y, $y + $size);
$prob_count =
get_num_probs_in_region($region);
if ($grid_size_thresholds{$size} <= $prob_count) {
push @hotspots, $region;
}
}
}
}
You now have a list of (possibly nested) square regions
of various sizes which identify hotspots. You can
eliminate regions which are completely contained by
other regions in the list, or you could use the info to
find "hot spots within warm spots"
(ie: "region 1,4;2,3 has had enough problems to deserve attention,
and within that 1,2;2,3 desrves special attention)
 [reply] [d/l] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by jepri (Parson) on Jul 17, 2002 at 23:35 UTC

You could try looking at the density of complaints in an area, even though that could be computationally intensive. It would give you some idea of where the 'clumps' are.
To find the rough edges of the clumps, find the number of complaints in a small radius, then increase the radius and see how many more you get. At some point the program will have to make the call as to whether or not it is worth expanding the circle to include the extra few points, which is a classical calculus problem.
You would also have to include extra rules to stop the circle from increasing forever or shrinking to nothing.
Other techniques might involve randomly (or less than randomly) drawing circles around the centre of the clump and seeing which ones catch the most complaints, then declaring the union of all the circles to be the clump.
____________________
Jeremy
I didn't believe in evil until I dated it.  [reply] 

This solution is similar to one that I was thinking of. I don't have code but the algorithm is like this:
 Pick a point where there's a complaint.
 Draw a circle of some arbitrary radius around it (a little experimentation will find the optimum radius, I'm sure).
 If another point falls within the radius, increase the area of the circle by a fixed amount and recenter it so it lies at the midpoint between the two points.
 If another point falls inside the circle, increase the area of the circle again and recenter it so the center lies as close to the center of the cluster as possible.
 Keep increasing the area and recentering as long as more points fall inside the circle.
Since you're increasing area instead of radius, the circle will stretch less and less each time until it stops growing. If the circle stops growing, record the center of the cluster and all the complaints inside it. Once you have a cluster, move on to a point that isn't part of a cluster yet and go again. If some point lies in two different clusters, make it part of the closer cluster.
 [reply] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by AbigailII (Bishop) on Jul 18, 2002 at 11:58 UTC

I'd say that the problem is a bit underspecified. Two trivial
solution spring into mind, both more or less fitting the
specification.
 Put all N complaints in one clump, and make one workorder.
 Make N clumps, putting one complaint in each clump, then
make a workorder for each clump.
Obviously, neither extreme is what you want, but what do you
want? Are there any limits on the number of clumps, the number
of complaints per clumb, the maximum distance between any pair
of complaints per clumb? Do you have an expressions in terms
of complaints, clumbs, distances that you want to minimize?
Even if you say something like that you want sqrt(N) of clumps,
there will be solutions that you probably do not wish  you could
make a scanline parallel to the Yaxis and make a sweep. Each
time your scanline hits a complaint, you increment a counter.
If the counter hits k * sqrt(N), the last sqrt(N) encountered
complaints are put together into a clump.
The hardest part is to come up with the specifictation: what
do you want to minimize/maximize/minmax/maxmin? That will
determine whether this problem is trivial to solve, or very
hard. I've already shown trivial solutions, but it isn't hard
to come up with specification that will quickly reduce problems
like the travelling salesman, or 01 knapsack to an instance
of the described problem.
Abigail  [reply] 

I would say the most likely definition would be that all items within a clump are at most X feet from any other item in the clump. X would probably have a default of 300 feet or so but could be userdefinable.
 [reply] 

Despite what I just said about NPcompleteness, the following algorithm
might give a reasonable solution (certainly not optimal).
 Let C be the set of all complaints.
 For each complaint c in C find the set S(c) of all complaints d in C
such that the distance between c and d is less than X (X is user defined).
 Find complaint e in C such that S(e) <= S(f) for all f in C;
that is, find the complaint who has the most other complaints nearby.
Pick a random one in case of a tie.
 Make a clump out of e and S(e).
 Remove all complaints g in S(e) from C. For all h remaining in C,
remove from S(h) all g in S(e).
 If C is empty, we're done. Else, goto 3.
Some pseudo code:
# Get set of all complaints.
my @C = get_all_complaints;
# Find all the associated sets.
my %D = map {my $c = $_;
$c => {map {$_ => 1}
grep {$c ne $_ && distance ($c, $_) < $X} @C}} @C;
while (%D) {
# Find complaint with the most nearby.
my ($complaint, $size) = (undef, 0);
while (my ($c, $set) = each %D) {
($complaint = $c, $size = keys %$set) if keys %$set > $size;
}
# Found largest, make a clump.
make_clump ($complaint, @{$D {$complaint}});
# Delete largest from set.
my $set = delete $D {$complaint};
# Delete associated set from set.
delete @D {keys %$set};
# Delete associated set from associated sets.
delete @{$_} {keys %$set} for values %D;
}
The performance will be quadratic, I'm afraid.
Abigail
 [reply] [d/l] 


But you need more than that, otherwise the "Put each complaint
in its own clump" is still a solution. It's tempting to add
the restriction that you want to minimize the number of clumps,
but that smells very much to an NPcomplete covering problem.
Abigail
 [reply] 
Use Inverse Square Law
by Notromda (Pilgrim) on Jul 18, 2002 at 15:28 UTC

Here's an off the wall idea, though I'm a long ways from trying to code it. Anyway, I'll start babbling and let someone else refine this, if it is worth it.
Think of it visually, like a dark room, and each complaint is a a candle. The radience from each candle dwindles as per inverse square law. If two candles are near each other, then they would have a bright area between them. All you need to do is find the brightest place and go there.
How does this work in code? I'm not sure, but a matrix might be usefull. For every complaint, increment the value of nearby geographical points inversely proportional to the distance from the complaint.
When all complaints are plotted, look for the highest value in the matrix, send a tech there.
Remove up to X complaints within a radius of Y, and recompute. repeat as necessary.
The granularity of the matrix could be adjusted as needed. The "luminosity" of complaints will also need tweaking.
Update: here's some code, but it doesn't quite do what I want. points on the grid next to a complaint are too "hot". Maybe it should be a more linear dropoff?
my @complaints = (
{
'complaint_id' => 'a123',
'x' => '45.2',
'y' => '39.7'
},
{
'complaint_id' => 'b456',
'x' => '46.5',
'y' => '41.2'
},
{
'complaint_id' => 'c789',
'x' => '11.9',
'y' => '29.8'
},
{
'complaint_id' => 'd863',
'x' => '95.3',
'y' => '17.2'
},
{
'complaint_id' => 'e635',
'x' => '65.5',
'y' => '33.3'
},
) ;
my @highlight = (0.0,0.0);
my $highvalue = 0.0;
foreach $testpoint (@complaints) {
for ($i = 0.0; $i < 100.0; $i += 1.0) {
for ($j = 0.0; $j < 100.0; $j+= 1.0) {
$dsqrd = ((($testpoint>{'x'}  $i)**2 + ($testpoint>{'y'
+}  $j)**2));
$points[$i][$j] += $dsqrd ? (1000 / ($dsqrd * 4 * 3.14)) :
+ 0 ;
if ($points[$i][$j] > $highvalue) {
@highlight = ($i,$j);
$highvalue = $points[$i][$j];
}
}
}
}
print "\n\n $highlight[0], $highlight[1] has value $highvalue\n";
 [reply] [d/l] 

I've played with it a little more  I made the luminosity taper off as 1/d instead of 1/4*pi*d**2. Also, I added a loop to remove visited nodes.
Even when I reinitialize the points array to 0, it still says to run off and fix one job before going to a job with three complaints. I think this has to do with a compliant coordinate that is really close to a grid point as oppsed to others that are not so close to a line. I'm testing it with a more granular grid. First thing I realized is a 1000 by 1000 grid takes a lot more time to compute. :P
After trying that experiment, I learned that it plucked off points one at a time... not what I was expecting.
I'm almost sure there's a better equation for calculating this... :)
Here's the code that provides an interesting result for the given test set:
my $service_radius=15;
my $numpoints=8;
my @complaints = (
{
'complaint_id' => 'a123',
'x' => '45.2',
'y' => '39.7'
},
{
'complaint_id' => 'b456',
'x' => '46.5',
'y' => '41.2'
},
{
'complaint_id' => 'c789',
'x' => '47.5',
'y' => '38.8'
},
{
'complaint_id' => 'd863',
'x' => '95.3',
'y' => '17.2'
},
{
'complaint_id' => 'e635',
'x' => '10.5',
'y' => '89.3'
},
{
'complaint_id' => 'f635',
'x' => '12.5',
'y' => '1.3'
},
{
'complaint_id' => 'g635',
'x' => '14.5',
'y' => '15.3'
},
{
'complaint_id' => 'h635',
'x' => '4.5',
'y' => '20.3'
},
) ;
while ($numpoints > 0) {
my @highlight = (0.0,0.0);
my $highvalue = 0.0;
my @points;
for ($i = 0.0; $i < 100.0; $i += 1.0) {
for ($j = 0.0; $j < 100.0; $j+= 1.0) {
$points[$i][$j] = 0 ;
}
}
foreach $testpoint (@complaints) {
next if not defined $testpoint;
print "Plotting " . $testpoint>{'complaint_id'} . "\n";
for ($i = 0.0; $i < 100.0; $i += 1.0) {
for ($j = 0.0; $j < 100.0; $j+= 1.0) {
$dsqrd = ((($testpoint>{'x'}  $i)**2 + ($testpoint>{'y'
+}  $j)**2));
$points[$i][$j] += $dsqrd ? (1 / sqrt ($dsqrd)) : 0 ;
if ($points[$i][$j] > $highvalue) {
@highlight = ($i,$j);
$highvalue = $points[$i][$j];
}
}
}
}
print "$highlight[0], $highlight[1] has value $highvalue\n";
foreach $testpoint (@complaints) {
next if not defined $testpoint;
if (sqrt (($testpoint>{'x'}  $highlight[0])**2 + ($testpoint>{'
+y'}  $highlight[1])**2) < $service_radius) {
print "removing " . $testpoint>{'complaint_id'} . " coordinat
+es $testpoint>{'x'} , $testpoint>{'y'}\n";
$testpoint = undef;
$numpoints;
}
}
print "\n\n";
} #end while
 [reply] [d/l] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by bedk (Pilgrim) on Jul 18, 2002 at 17:40 UTC

You're describing a clustering problem. If you know how many clumps you want to end up with, you
can run a kmeans
clustering algorithm, which is just a mathematical formulation of Notromda's suggestion.
Because you have a desired maximum distance, you can choose the number of clusters (K) by starting with K=1,
and running kmeans clustering. If the distance between elements in any cluster is too big, increment K and recluster.
Iterate until until all clusters meet the maximum distance criterion.
Brian
 [reply] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by talexb (Chancellor) on Jul 18, 2002 at 18:16 UTC

Again, a fascinating thread. And amazingly I actually studied this kind of problem in university 20 years ago (shudder) under a brilliant professor called Ed Jernigan (University of Waterloo, faculty of Engineering, department of Systems Design).
So for scholarly research I can suggest Google.
AbigailII's point is quite valid  there need to be constraints on the solution otherwise optimization cannot be done. You need to do something like
 set a maximum cluster radius and
 set a minimum and/or a maxmimum number of complaints within each cluster.
t. alex
"Mud, mud, glorious mud. Nothing quite like it for cooling the blood!"
Michael Flanders and Donald Swann
 [reply] 
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by FoxtrotUniform (Prior) on Jul 18, 2002 at 16:16 UTC

One offthecuff idea: treat this as an imageprocessing
problem.
 Segment your world into a grid of "pixels", with
the intensity of each pixel being the number of problems
reported there.
 Blur the "image" a bit, to group complaints that are
relatively close, but not in adjacent pixels. How much
you blur is probably a matter of experimentation.
 Run an edge detection algorithm over the
image. You now have borders defining clumps of
complaints.
The major problem that I can see is that you might get
several closelyspaced groups blurring into each other,
creating a single superlarge group.
Update: "Segment the world, find local maxima,
then calculate the voronoi diagram of the point
set described by the maxima" might work, too.

The hell with paco, vote for Erudil!
:wq
 [reply] 

Too much time with the GIMP, eh? :)
But how do you decide how to "segment your world"? That could be a problem...
 [reply] 

