Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Sorting by geographical proximity / clumping groups of items based on X and Y

by vroom (Pope)
on Jul 17, 2002 at 22:59 UTC ( #182629=perlquestion: print w/ replies, xml ) Need Help??
vroom has asked for the wisdom of the Perl Monks concerning the following question:

I have an application which tracks complaints based on location. The X and Y for the problem location is then stored. However many complaints could be placed for one problem. Before a work order is made it would be nice to be able to find all items that fall within a similar geographical area and then associate all of those complaints with the newly created work order. The goal would be to find clumps of problems and list them together. It's straight forward to find all the items that fall within X feet of a given location. Doing this for all the possible combinations within the dataset is less desirable.
The goal would be to find a decent solution that isn't too computationally intensive to identify and list these clumps of complaints. Some thoughts I had:
  • Sort the list by x and sort the list by y. Then come up with a method for merging the lists w/o listing a given complaint more than once. Downside: Identifying a method for merging the results that doesn't result in too many misses
  • Round the X and Y to the nearest N feet then sort by X concatenated with Y. This finds all items which fall within a given square on the grid. Downside: This fails to notice items which are just over the border from each other.
Has anyone had any experience with a similar problem or have an approach that they think would be useful?

Thanks,
Tim

Comment on Sorting by geographical proximity / clumping groups of items based on X and Y
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by DamnDirtyApe (Curate) on Jul 17, 2002 at 23:22 UTC

    Here's a quick example that will hopefully get you started.

    #! /usr/bin/perl use strict ; use warnings ; my @complaints = ( { 'complaint_id' => 'a123', 'x' => '45.2', 'y' => '39.7' }, { 'complaint_id' => 'b456', 'x' => '79.3', 'y' => '42.0' }, { 'complaint_id' => 'c789', 'x' => '11.9', 'y' => '29.8' }, { 'complaint_id' => 'd863', 'x' => '95.3', 'y' => '17.2' }, { 'complaint_id' => 'e635', 'x' => '65.5', 'y' => '33.3' }, ) ; my ( $curr_x, $curr_y ) = ( 36.2, 47.3 ) ; my @sorted_by_proximity = map { $_->[1]->{'dist'} = $_->[0]; $_->[1] } sort { $a->[0] <=> $b->[0] } map { [ distance( $curr_x, $curr_y, $_->{'x'}, $_->{'y'} ), $_ ] } @complaints ; foreach ( @sorted_by_proximity ) { print $_->{'complaint_id'} . " is " . $_->{'dist'} . " away.\n" ; } sub distance { my ($x1, $y1, $x2, $y2 ) = @_ ; return sqrt( abs( $x2 - $x1 )**2 + abs( $y2 - $y1 )**2 ) ; }

    This gives you an array ref of complaints, sorted closest-first. That way, you could take the first n complaints, or all the complaints until the distance is greater than x, etc.

    Update: Thanks to dws and hossman both correct. Here's mv revised code which may be a little more useful.

    #! /usr/bin/perl use strict ; use warnings ; # Define some rules and constants. my $CLUSTER_RADIUS = 10 ; # Max distance to be included in cluster. my $CENTER_SKIP = 5 ; # Distance between centers on cluster scans +. my $CLUSTER_COUNT = 4 ; # Minimum no. of complaints to make a clust +er. my $START_X = 0 ; my $END_X = 100 ; my $START_Y = 0 ; my $END_Y = 100 ; # Load the complaints. my @complaints = () ; while ( <DATA> ) { next if /^$/ ; my $hashref = {} ; ( $hashref->{'id'}, $hashref->{'x'}, $hashref->{'y'} ) = split ; push @complaints, $hashref ; } for ( my $curr_x = $START_X ; $curr_x <= $END_X ; $curr_x += $CENTER_SKIP ) { for ( my $curr_y = $START_Y ; $curr_y <= $END_Y ; $curr_y += $CENTER_SKIP ) { my @near = map { $_->[1]->{'dist'} = $_->[0] ; $_->[1] } grep { $_->[0] <= $CLUSTER_RADIUS } map { [ distance( $curr_x, $curr_y, $_->{'x'}, $_->{'y'} ), $_ ] } @complaints ; my $complaint_count = @near ; if ( $complaint_count >= $CLUSTER_COUNT ) { printf "%d complaints near (%d,%d).\n", $complaint_count, $curr_x, $curr_y ; } } } exit 0 ; #------------------------------------------------------- sub distance { my ($x1, $y1, $x2, $y2 ) = @_ ; return sqrt( ( $x2 - $x1 )**2 + ( $y2 - $y1 )**2 ) ; } __DATA__ a123 10.1 12.5 b124 12.5 8.6 c125 9.8 10.2 d213 24.6 19.5 e753 35.6 2.2 f854 76.5 46.7 g354 8.7 8.6

    _______________
    D a m n D i r t y A p e
    Home Node | Email
        return sqrt( abs( $x2 - $x1 )**2 + abs( $y2 - $y1 )**2 ) ;
      The abs() isn't necessary, since the result of squaring the difference will always be positive.

      For purposes of ordering pairs by distance, it is sufficient to sum the squared differences and sort on distance^2, deferring the sqrt() until a true distance is necessary. (I picked up this trick from Jon Bentley's "Writing Efficient Programs," which is out of print, but worth grabbing if you find a copy in a used bookstore.)

      You can also exclude from consideration, before sorting, any pair that has a difference in either dimension that is > X. Such pairs will be excluded anyway, so why clog up the sort with them.

      that assumes a $curr_x and a $curr_y ... my understanding of the question was to find clusters of complaints -- not just the problems near a given location.
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by jepri (Parson) on Jul 17, 2002 at 23:35 UTC
    You could try looking at the density of complaints in an area, even though that could be computationally intensive. It would give you some idea of where the 'clumps' are.

    To find the rough edges of the clumps, find the number of complaints in a small radius, then increase the radius and see how many more you get. At some point the program will have to make the call as to whether or not it is worth expanding the circle to include the extra few points, which is a classical calculus problem.

    You would also have to include extra rules to stop the circle from increasing forever or shrinking to nothing.

    Other techniques might involve randomly (or less than randomly) drawing circles around the centre of the clump and seeing which ones catch the most complaints, then declaring the union of all the circles to be the clump.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

      This solution is similar to one that I was thinking of. I don't have code but the algorithm is like this:

      • Pick a point where there's a complaint.
      • Draw a circle of some arbitrary radius around it (a little experimentation will find the optimum radius, I'm sure).
      • If another point falls within the radius, increase the area of the circle by a fixed amount and recenter it so it lies at the midpoint between the two points.
      • If another point falls inside the circle, increase the area of the circle again and recenter it so the center lies as close to the center of the cluster as possible.
      • Keep increasing the area and recentering as long as more points fall inside the circle.

      Since you're increasing area instead of radius, the circle will stretch less and less each time until it stops growing. If the circle stops growing, record the center of the cluster and all the complaints inside it. Once you have a cluster, move on to a point that isn't part of a cluster yet and go again. If some point lies in two different clusters, make it part of the closer cluster.

Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by dws (Chancellor) on Jul 17, 2002 at 23:51 UTC
    One simple approach to geographic searching is to precalculate a "bucket" for each coordinate, and then organize the data such that the bucket number can be used as a key to quickly lookup all coordinates that fall within that bucket bucket. This quickly culls down the search space, allowing you to make the pairwise distance calculation on a subset of the data.

    The initial filtering to find all points within X feet of a given point by start by considering how many additional buckets along one vector (say, along the X axis) need to be considered such that the far side of one bucket to the far side of the end bucket is >= X feet. This number forms an integral radius for a sweep around the bucket space (or, more simply, to determine the square grid of buckets to be searched).

    If you bucket sizes selected such that most searches can be done by considering a bucket and its immediate neighbors, the SQL query is pretty simple. Prepare the query   select * from complaint where bucket in (?,?,?,?,?,?,?,?,?)
    and then plug in the bucket ids when you execute. If you need to consider more buckets, write the bucket numbers into a temporary table, using that table to JOIN against the complaint table.

    Then perform the distance calculation on whatever comes back, and sort accordingly.

Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by hossman (Prior) on Jul 18, 2002 at 00:55 UTC
    Questions lke this tend to require some defintion of "proximity" which is rarely as simple as "N points within a polygon or area M" ... typically people want to find trends -- 3 problems within 1 grid square is a trend, 1 is not -- BUT! 1 per grid square for 10x10 grid squares is.

    An algorithm that seems like it would work well, assumes that you have a range of "region sizes" that can overlap, and difffernet thresolds for the number of problems that constitute a "hot spot region". In psuedo code...

    %grid_size_thresholds = ( 1 => 3, # a 1x1 square grid is hot if it has 3 probs 2 => 10,# a 2x2 square grid is hot if it has 10 probs 3 => 30,# ... ... ) @hotspots = (); foreach $size (sort keys %grid_size_thresholds) { for ($x = 0; $x + $size < $GRID_WIDTH; $x++) { for ($y = 0; $y + $size < $GRID_HEIGHT; $y++) { $region = new Region($x, $x + $size, $y, $y + $size); $prob_count = get_num_probs_in_region($region); if ($grid_size_thresholds{$size} <= $prob_count) { push @hotspots, $region; } } } }
    You now have a list of (possibly nested) square regions of various sizes which identify hotspots. You can eliminate regions which are completely contained by other regions in the list, or you could use the info to find "hot spots within warm spots" (ie: "region 1,4;2,3 has had enough problems to deserve attention, and within that 1,2;2,3 desrves special attention)

Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by Abigail-II (Bishop) on Jul 18, 2002 at 11:58 UTC
    I'd say that the problem is a bit underspecified. Two trivial solution spring into mind, both more or less fitting the specification.
    • Put all N complaints in one clump, and make one workorder.
    • Make N clumps, putting one complaint in each clump, then make a workorder for each clump.
    Obviously, neither extreme is what you want, but what do you want? Are there any limits on the number of clumps, the number of complaints per clumb, the maximum distance between any pair of complaints per clumb? Do you have an expressions in terms of complaints, clumbs, distances that you want to minimize?

    Even if you say something like that you want sqrt(N) of clumps, there will be solutions that you probably do not wish - you could make a scanline parallel to the Y-axis and make a sweep. Each time your scanline hits a complaint, you increment a counter. If the counter hits k * sqrt(N), the last sqrt(N) encountered complaints are put together into a clump.

    The hardest part is to come up with the specifictation: what do you want to minimize/maximize/minmax/maxmin? That will determine whether this problem is trivial to solve, or very hard. I've already shown trivial solutions, but it isn't hard to come up with specification that will quickly reduce problems like the travelling salesman, or 0-1 knapsack to an instance of the described problem.

    Abigail

      I would say the most likely definition would be that all items within a clump are at most X feet from any other item in the clump. X would probably have a default of 300 feet or so but could be user-definable.
        But you need more than that, otherwise the "Put each complaint in its own clump" is still a solution. It's tempting to add the restriction that you want to minimize the number of clumps, but that smells very much to an NP-complete covering problem.

        Abigail

        Despite what I just said about NP-completeness, the following algorithm might give a reasonable solution (certainly not optimal).
        1. Let C be the set of all complaints.
        2. For each complaint c in C find the set S(c) of all complaints d in C such that the distance between c and d is less than X (X is user defined).
        3. Find complaint e in C such that |S(e)| <= |S(f)| for all f in C; that is, find the complaint who has the most other complaints nearby. Pick a random one in case of a tie.
        4. Make a clump out of e and S(e).
        5. Remove all complaints g in S(e) from C. For all h remaining in C, remove from S(h) all g in S(e).
        6. If C is empty, we're done. Else, goto 3.
        Some pseudo code:
        # Get set of all complaints. my @C = get_all_complaints; # Find all the associated sets. my %D = map {my $c = $_; $c => {map {$_ => 1} grep {$c ne $_ && distance ($c, $_) < $X} @C}} @C; while (%D) { # Find complaint with the most nearby. my ($complaint, $size) = (undef, 0); while (my ($c, $set) = each %D) { ($complaint = $c, $size = keys %$set) if keys %$set > $size; } # Found largest, make a clump. make_clump ($complaint, @{$D {$complaint}}); # Delete largest from set. my $set = delete $D {$complaint}; # Delete associated set from set. delete @D {keys %$set}; # Delete associated set from associated sets. delete @{$_} {keys %$set} for values %D; }
        The performance will be quadratic, I'm afraid.

        Abigail

Use Inverse Square Law
by Notromda (Pilgrim) on Jul 18, 2002 at 15:28 UTC
    Here's an off the wall idea, though I'm a long ways from trying to code it. Anyway, I'll start babbling and let someone else refine this, if it is worth it.

    Think of it visually, like a dark room, and each complaint is a a candle. The radience from each candle dwindles as per inverse square law. If two candles are near each other, then they would have a bright area between them. All you need to do is find the brightest place and go there.

    How does this work in code? I'm not sure, but a matrix might be usefull. For every complaint, increment the value of nearby geographical points inversely proportional to the distance from the complaint.

    When all complaints are plotted, look for the highest value in the matrix, send a tech there.

    Remove up to X complaints within a radius of Y, and recompute. repeat as necessary.

    The granularity of the matrix could be adjusted as needed. The "luminosity" of complaints will also need tweaking.

    Update:
    here's some code, but it doesn't quite do what I want. points on the grid next to a complaint are too "hot". Maybe it should be a more linear dropoff?

    my @complaints = ( { 'complaint_id' => 'a123', 'x' => '45.2', 'y' => '39.7' }, { 'complaint_id' => 'b456', 'x' => '46.5', 'y' => '41.2' }, { 'complaint_id' => 'c789', 'x' => '11.9', 'y' => '29.8' }, { 'complaint_id' => 'd863', 'x' => '95.3', 'y' => '17.2' }, { 'complaint_id' => 'e635', 'x' => '65.5', 'y' => '33.3' }, ) ; my @highlight = (0.0,0.0); my $highvalue = 0.0; foreach $testpoint (@complaints) { for ($i = 0.0; $i < 100.0; $i += 1.0) { for ($j = 0.0; $j < 100.0; $j+= 1.0) { $dsqrd = ((($testpoint->{'x'} - $i)**2 + ($testpoint->{'y' +} - $j)**2)); $points[$i][$j] += $dsqrd ? (1000 / ($dsqrd * 4 * 3.14)) : + 0 ; if ($points[$i][$j] > $highvalue) { @highlight = ($i,$j); $highvalue = $points[$i][$j]; } } } } print "\n\n $highlight[0], $highlight[1] has value $highvalue\n";
      I've played with it a little more - I made the luminosity taper off as 1/d instead of 1/4*pi*d**2. Also, I added a loop to remove visited nodes.

      Even when I re-initialize the points array to 0, it still says to run off and fix one job before going to a job with three complaints. I think this has to do with a compliant coordinate that is really close to a grid point as oppsed to others that are not so close to a line. I'm testing it with a more granular grid. First thing I realized is a 1000 by 1000 grid takes a lot more time to compute. :P

      After trying that experiment, I learned that it plucked off points one at a time... not what I was expecting.

      I'm almost sure there's a better equation for calculating this... :)

      Here's the code that provides an interesting result for the given test set:

      my $service_radius=15; my $numpoints=8; my @complaints = ( { 'complaint_id' => 'a123', 'x' => '45.2', 'y' => '39.7' }, { 'complaint_id' => 'b456', 'x' => '46.5', 'y' => '41.2' }, { 'complaint_id' => 'c789', 'x' => '47.5', 'y' => '38.8' }, { 'complaint_id' => 'd863', 'x' => '95.3', 'y' => '17.2' }, { 'complaint_id' => 'e635', 'x' => '10.5', 'y' => '89.3' }, { 'complaint_id' => 'f635', 'x' => '12.5', 'y' => '1.3' }, { 'complaint_id' => 'g635', 'x' => '14.5', 'y' => '15.3' }, { 'complaint_id' => 'h635', 'x' => '4.5', 'y' => '20.3' }, ) ; while ($numpoints > 0) { my @highlight = (0.0,0.0); my $highvalue = 0.0; my @points; for ($i = 0.0; $i < 100.0; $i += 1.0) { for ($j = 0.0; $j < 100.0; $j+= 1.0) { $points[$i][$j] = 0 ; } } foreach $testpoint (@complaints) { next if not defined $testpoint; print "Plotting " . $testpoint->{'complaint_id'} . "\n"; for ($i = 0.0; $i < 100.0; $i += 1.0) { for ($j = 0.0; $j < 100.0; $j+= 1.0) { $dsqrd = ((($testpoint->{'x'} - $i)**2 + ($testpoint->{'y' +} - $j)**2)); $points[$i][$j] += $dsqrd ? (1 / sqrt ($dsqrd)) : 0 ; if ($points[$i][$j] > $highvalue) { @highlight = ($i,$j); $highvalue = $points[$i][$j]; } } } } print "$highlight[0], $highlight[1] has value $highvalue\n"; foreach $testpoint (@complaints) { next if not defined $testpoint; if (sqrt (($testpoint->{'x'} - $highlight[0])**2 + ($testpoint->{' +y'} - $highlight[1])**2) < $service_radius) { print "removing " . $testpoint->{'complaint_id'} . " coordinat +es $testpoint->{'x'} , $testpoint->{'y'}\n"; $testpoint = undef; $numpoints--; } } print "\n\n"; } #end while
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by FoxtrotUniform (Prior) on Jul 18, 2002 at 16:16 UTC

    One off-the-cuff idea: treat this as an image-processing problem.

    1. Segment your world into a grid of "pixels", with the intensity of each pixel being the number of problems reported there.
    2. Blur the "image" a bit, to group complaints that are relatively close, but not in adjacent pixels. How much you blur is probably a matter of experimentation.
    3. Run an edge detection algorithm over the image. You now have borders defining clumps of complaints.

    The major problem that I can see is that you might get several closely-spaced groups blurring into each other, creating a single super-large group.

    Update: "Segment the world, find local maxima, then calculate the voronoi diagram of the point set described by the maxima" might work, too.

    --
    The hell with paco, vote for Erudil!
    :wq

      Too much time with the GIMP, eh? :-)

      But how do you decide how to "segment your world"? That could be a problem...

Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by bedk (Pilgrim) on Jul 18, 2002 at 17:40 UTC
    You're describing a clustering problem. If you know how many clumps you want to end up with, you can run a k-means clustering algorithm, which is just a mathematical formulation of Notromda's suggestion. Because you have a desired maximum distance, you can choose the number of clusters (K) by starting with K=1, and running k-means clustering. If the distance between elements in any cluster is too big, increment K and recluster. Iterate until until all clusters meet the maximum distance criterion.

    Brian
Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by talexb (Canon) on Jul 18, 2002 at 18:16 UTC
    Again, a fascinating thread. And amazingly I actually studied this kind of problem in university 20 years ago (shudder) under a brilliant professor called Ed Jernigan (University of Waterloo, faculty of Engineering, department of Systems Design).

    So for scholarly research I can suggest Google.

    Abigail-II's point is quite valid -- there need to be constraints on the solution otherwise optimization cannot be done. You need to do something like

    • set a maximum cluster radius and
    • set a minimum and/or a maxmimum number of complaints within each cluster.

    --t. alex

    "Mud, mud, glorious mud. Nothing quite like it for cooling the blood!"
    --Michael Flanders and Donald Swann

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://182629]
Approved by Chmrr
Front-paged by zejames
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2014-12-21 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (104 votes), past polls