punkish has asked for the
wisdom of the Perl Monks concerning the following question:
Monks,
There are many threads on pointinpolygon analyses, and much published elsewhere. I know how to do it, but my problem is of speed. I am using the algorithm in the Wolf Book, and it works well. So, given an array of points (x,y pairs), and an array of polys (an array of array of points), I do the following 
POINT: foreach point {
POLY: foreach poly {
if point is in poly {
do something
last POLY
}
}
}
I have more than 5 million points, and more than 200k polys, leading to a possible 1 trillion intersections to be computed... obviously less than that because of last POLY, nevertheless, a pretty tedious task.
I seek advice on how to speed this up. Several thoughts come to mind, some difficult, some perhaps not so 
 Cache the polys... build a Berkeley DB while checking the first point, and from then on, test against the db instead of the polys file.
 Somehow compute where the point is likely to lie, so as to not do unnecessary comparisons... for example, what's the point of comparing cities in Maine with counties in California. To do so, I would have to build some kind of quadtree index.
 any other...
WWMD?

when small people start casting long shadows, it is time to go to bed
Re: Speeding up pointinpolygon
by bobf (Monsignor) on Jul 23, 2006 at 17:49 UTC

I'd consider adding an initial screen that uses a much faster algorithm. If a point passes the screen, then the point_in_polygon routine could be called (for those that don't already own it, the Wolf Book is Mastering Algorithms with Perl). Both of the approaches I describe are similar to your Maine/California example, and both would work well with a database to store the results and avoid repeat calculations.
The first approach would be to partition your space into grids (the size of which should be determined based on how tightly your polygons and points are clustered). Determine which partition the polygons and points are in, then for each partition only test that set of polygons and points.
The second approach is a little more complex, but it might be worthwhile if your polygons and points are tightly clustered and the grid partition method only minimally reduces the calculations. Circumscribe a circle around each polygon^{1} and store the center point and radius for the circle. The initial screen then becomes a very simple (and fast) comparison of the distance between the center of the circle and the point in question, vs the radius of the circle. This method won't work as well if the polygons are very irregularlyshaped, as you'll get a lot of points that fall into the crevices^{2}.
Alternatively, you could look into coding the algorithm in C (using perlxs) or PDL.
HTH
^{1}Update 1: Not all polygons can be circumscribed in according to the true definition of "circumscribe", which states that every vertex of the polygon must be on the circle. For the purposes of this problem, I intended to suggest that any circle large enough to encompass the polygon could be used. This could be determined by trial and error using unit increments to the radius, or it could be determined more precisely by finding the largest distance between two vertices (which becomes the diameter of the circle, therefore the center is the midpoint of that segment).
Please note that there is room for error in this approach unless every point of the polygon is tested to ensure they are all contained within the circle. Nonetheless, it may be good enough for a first pass, and a suitable margin of error be used (doubling the size of the circle, etc) to minimize the chance for error.
^{2}Update 2: NiJo had a great suggestion  instead of using a circle as I initially proposed, use a rectangle. The bounds would be much easier to calculate and it would likely function just as well (if not better, since it would be guaranteed to contain the entire polygon) for a first pass screen.
 [reply] [d/l] 

I would spend a lot of time preprocessing to cut down on the amount of looping you have to do.
In describing time complexity I'll use M for the number of points you have and N for the number of rectangles.
First, wrap every polygon in a rectangle as described in this tip. This is O(n).
Then, create four lists of rectangles sorted by xMin, xMax, yMin, and yMax. This is O(n log n) if you use quicksort.
Now, you cluster the rectangles into superrectangles. How you do this is up to you, and should probably be informed by your data. If its evenly distributed you can do this in essentially O(n) time by simply using a grid system.
If it is not evenly distributed, you do something like the following recursively: for any superrectangle, roll randomly from the set of xMin, xMax, yMin, and yMax. Go to the median of the list you just selected (this should be constant time if your lists are backed by arrays). Draw a line there separating half of the points on one side of the line and half on the other. i.e. if you picked yMax you draw a horizontal line there. Then you use an O(n) loop to assign rectangles to either or both of the new slightlylesssuper rectangles. If a rectangle contains more than some arbitrary threshhold of your total number of rectangles, you recursively subdivide it. Otherwise, you stop that level of the recursion and add that superrectangle to your superrectangle list. (Note, you'll want a sanity check on your recursion depth here, as it is possible but unlikely in practice that this will loop forever otherwise)
So now you've got a list of superrectangles, each of which has a list of subrectangles, each of which is associated with a polygon. Following so far? Now sort the list of superrectangles by number of rectangles (smallest to largest), and sort each list of subrectangles by area (largest to smallest). The goal of both of these sorts is to maximize the probability that you will quickly find one qualifying polygon and be able to kill the search, skipping huge swathes of your M*N search space.
Then, and only then, you get to (forgive nonPerl pseudocode),
for(foo in points)
for(bar in superRectangles)
if (bar contains foo)
for (foobar in subRectangles of bar)
if (foobar contains foo)
if (polygon of foobar contains foo)
do whatever
loop to next point
 [reply] [d/l] 
Re: Speeding up pointinpolygon
by explorer (Chaplain) on Jul 23, 2006 at 19:10 UTC

 Install Math::Geometry::Planar
 Edit Planar.pm and add
use Memoize;
memoize('CrossProduct');
 Buy more RAM :)
 Insert: "next polygon if point's coords are out of polygon's box limits" (2 conditionals)
 Use the isinside function of Math::Geometry::Planar.
Update: Added the 'point out of the box' line.
Update: I used this solution the last year. Other solution is to use the clipping routines of GPC library (in CPAN also) or use Inline::C.
 [reply] [d/l] 
Re: Speeding up pointinpolygon
by NiJo (Friar) on Jul 23, 2006 at 20:09 UTC

"Knowing your data" is the first step when optimizing. I'd plot, look and think about a managable subsample.
Then it could make sense to extend and simplify the screening algorithm idea.
The cartesian data makes it much easier to use squares. I'd reduce each polygon to a 'outer' square (min_x, max_x, min_y, max_y). Databases are good at searching, so the list of potential points is just a sql query away.
The next step is "punching a hole" into the polygon. This could find points that are surely inside. Iteratively extending an 'inner' box from the center of the outer box would work if the center is inside the polygon.
 [reply] 

 [reply] [d/l] [select] 

If you can reduce the data to cartesian coords. and the entire zone can reside into RAM, PDL is the faster solution.
Example: If $zone is a piddle of bytes with an 1 where the point's coords, you can retrieve the coords of points into polygon's box limits with:
$area = $zone>slice("$xmin:$xmax,$ymin:$ymax");
$coord_points = wichND( $area );
Now, you will have the coords of all points into the $area:
@x_coords = list $coord_points>slice("(0),:");
@y_coords = list $coord_points>slice("(1),:");
This is the fast clipping method if you can reduce the data universe to cartesian coords.
And PDL is a C compiled library.  [reply] [d/l] [select] 
Re: Speeding up pointinpolygon
by Ieronim (Friar) on Jul 23, 2006 at 18:02 UTC

How often will you need to perform such task? How big are polygons? How many polygons can you store in memory at once?
Reading the polygons from file at every turn of the cycle is surely a bad idea. You need to store as many polys as you can in the memory and match the points against them. then throw away all matched points, as the polygons don't intersect, and match all left points against the next portion of polys. This advice is true in any case.
Maybe you can improve your matching function — look at the source code of Math::Poly::Calc, the polygon_contains_point() sub. Rewriting it as XS will be even faster. The algorithms can be found here. This is a hard way to go—do it only if the memory caching does not give enough performance gain.
These simple optimizations might not be enough if you need to perform this task often, and they will work better in combination with bobf's algorithmic improvements.
s;;JustmenothNimPNilmIarONi;;tr?IerONim?HAcker ?d;print
 [reply] [d/l] 
Re: Speeding up pointinpolygon
by BrowserUk (Pope) on Jul 24, 2006 at 00:41 UTC

Caching the results of your pointinpoly function will not yeild any performance gain unless you have duplicate points. And if you have duplicate points you'd almost certainly be better off sorting & grouping them so that you only look each point up once as the space requirements for the caching is nontrivial.
Personally, I'd preprocess the polys to group them by their xmin bound, and subgroup the by their ymin bound. That way, you would only do a full pointinpoly comparison of each point against a small subset of the total polys. Eg.
my %hashedPolys;
for ( 0 .. $#polys ) {
push @{ $hashPolys{ xmin( $polys[ $#polys ] ) % 100 }{ ymin( $poly
+s[ $#polys ) % 100 } }, pop @polys;
}
...
for my $point ( @points ) {
for my $poly ( @{ $hashedPolys{ $point[0] % 100 }{ $point[ 1 ] % 1
+00 } } ) {
if( pointInPoly( $point, $poly ) ) {
## take whatever action is appropriate
}
}
}
The choice of granularity (x & y) for the hashing (where I've used % 100) will depend very much on the size and shape of your total coordinate space and the maximum size (enclosing rectangle), of your polys.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] [select] 
Re: Speeding up pointinpolygon
by sgifford (Prior) on Jul 24, 2006 at 00:52 UTC

You could also think about having a database do some of the heavy lifting for you. MySQL spatial extensions are designed to solve this problem, and PostGIS does the same thing for Postgres.
 [reply] 

 [reply] 

A good point, though sometimes simplicity can be very complicated. :) Please post back if you find any useful GIS stuff for SQLite or Berkeley DB; I've been meaning to look, but haven't yet.
 [reply] 
Re: Speeding up pointinpolygon
by bart (Canon) on Jul 24, 2006 at 10:01 UTC

Just some rough ideas... You're using a brute force approach on lots and lots of polygons. Refine it.
I have the impression that usually (I'm thinking of 3D modeling and raytracing, in the first place), these polygons tend to be grouped in cluster areas of relatively small sizes. You should therefore group the polygons in polygon groups, create an outside box or a superpolygon (the polygon that contains all polygons in the group) and check if a point is inside that superpolygon. If not, you can skip the whole group.
As a very rough example, suppose you can split your screen into 4 quadrants, each containing about 1/4 of the polygons. You test for each point if it falls in such a quadrant, and only for that quadrant, test the polygons in the way you did. You'll get an average speedup of close to 4x already.
For maximum effectiveness, you should get a lot of those groups first, that all span as little area as possible.
So, the object is to get groups out of all of your polygons first, so this splits up into as small area as possible. The areas may overlap, but preferable as little as possible...
So, how can you achieve that?
For a start, I'd take out the really large polygons. You'll likely have to brute force those anyway.
But I suspect most polygons will be really tiny, almost not much larger than dots. Well, you could try using clustering algorithms for dots — on their centers of gravity, for example.
To put those into groups, there's the algorithm that belg4mit asked about a little while ago. You could try that out for size. Oh yeah, I turned that into Perl code. I almost forgot. ;)
 [reply] 
Re: Speeding up pointinpolygon
by Dervish (Friar) on Jul 24, 2006 at 07:01 UTC

One final consideration: is Perl the best choice for implementation environment for this task? A repeated tight loop through a mathematically bound calculation just calls out for a compiled program of some kind. I'll admit, I'm no expert on Perl performance, but sometimes tasks that are 'fast enough' turn out to not be fast enough when you do them 1 trillion times, and it's worth taking the extra effort to write it in a more specialized language.  [reply] 
Re: Speeding up pointinpolygon
by Moron (Curate) on Jul 24, 2006 at 11:15 UTC

Pursuing your own 'prune the unnecessary' idea led me to this idea:
Preparation: Find the overall envelope and divide it into equal rectangles creating a grid isomorphic to (0..root(max(x)  min(x))) X (0..root(max(y)  min(y))). This creates discrete grid coordinates which can be computed for each point. Store the polygons first as an array and then make a hash(xgridcoord) of hash(ygridcoord) of point(=small array) of hash of indexes into the polygon array. Wrap all of this into another hash to store both in a Storable. This has the effect of making a newstyle hash (associative array) function as an oldstyle hash (i.e. indirect addressing that buckets neighbours together in the storage space, to drastically reduce lookup time).
Iteration: To find which polygons contain a test point then requires computing the grid reference of the point and looking up in the hash of hash for which polygons include it. After the oneoff preparation is complete, although that is large, this reduces subsequent iterations to one per point, with the nested hash lookups substituting for what was previously a crossproduct of iterations.
Update: Could also let the points themselves (ungridded) loose on a hash of hash and test to see which approach is quicker  I have some expectation that the overhead of computing coordinates (in the grid) (bucketed hashing) should win versus lookup with large hash element counts when using raw hashing, but by no means certainty.
 [reply] 
Re: Speeding up pointinpolygon
by swampyankee (Parson) on Jul 24, 2006 at 19:33 UTC

This is a wellstudied problem, and your problem set many not even come into the category of "big" by the computational geometry mavens. Have you looked at PNPoly
, Computational Geometry Algorithm Library or Graphic Gems? Another possibility would be to use a constrained Delaunay triangulation.
Of more relevance to Perl, writing a Perl extension is one way to go. I think that using a specialized computational geometry library is likely to be more satisfying, though. Try CGAL.
emc
Login incorrect.
Only perfect spellers may
enter this system.
Jason Axley
 [reply] 
Re: Speeding up pointinpolygon
by zentara (Archbishop) on Jul 24, 2006 at 16:41 UTC

This is just a bit of brainstorming, possibly showing my Tk mindset. I know you posed the question in terms of rigourous mathematical results, but I was thinking of using the Tk Canvas and it's overlapping tags method, to do somethink like this. Of course, you do need a Xserver going, but you also gain the benefit of being able to setup a realtime display of your pointqueries. And of course, you have the problem of drawing all your states, counties and cities.
#!/usr/bin/perl
use warnings;
use strict;
use Tk;
my $top = new MainWindow;
my $c=$top>Canvas>pack;
my $circle = $c>createOval(30,30,100,100,
fill => 'blue',
tags =>['circle'],
stipple => 'gray12',
);
my $rect1 = $c>createRectangle(10,10,44,44,
fill => 'green',
stipple => 'gray12',
tags =>['rect1'],
);
my $rect2 = $c>createRectangle(93,93,200,200,
fill => 'yellow',
tags =>['rect2'],
stipple => 'gray12',
);
my $poly1 = $c>createPolygon(0,0, 44,44, 55,55, 90,90, 200,200, 10,10
+0,0,0,
fill => 'red',
smooth => 1,
splinesteps => 100,
stipple => 'gray12',
tags =>['poly1'],
);
$c>Tk::bind("<Motion>", [ \&print_xy, Ev('x'), Ev('y') ]);
&print_xy($c, 42, 42);
MainLoop;
sub print_xy {
my ($canv, $x, $y) = @_;
print "(x,y) = ", $canv>canvasx($x), ", ", $canv>canvasy($y), "\n";
#my $x1 = $x+1;
#my $y1 = $y+1;
#it will actually use a zero size rectangle
my (@current) = $canv>find('overlapping', $x, $y, $x, $y);
foreach my $id(@current){
print $canv>gettags($id),' ';
}
print "\n";
}
 [reply] [d/l] 

