Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Syntactic Confectionery Delight
 
PerlMonks  

Speeding up point-in-polygon

by punkish (Priest)
on Jul 23, 2006 at 17:07 UTC ( #563130=perlquestion: print w/ replies, xml ) Need Help??
punkish has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

There are many threads on point-in-polygon analyses, and much published elsewhere. I know how to do it, but my problem is of speed. I am using the algorithm in the Wolf Book, and it works well. So, given an array of points (x,y pairs), and an array of polys (an array of array of points), I do the following --

POINT: foreach point { POLY: foreach poly { if point is in poly { do something last POLY } } }

I have more than 5 million points, and more than 200k polys, leading to a possible 1 trillion intersections to be computed... obviously less than that because of last POLY, nevertheless, a pretty tedious task.

I seek advice on how to speed this up. Several thoughts come to mind, some difficult, some perhaps not so --

  • Cache the polys... build a Berkeley DB while checking the first point, and from then on, test against the db instead of the polys file.
  • Somehow compute where the point is likely to lie, so as to not do unnecessary comparisons... for example, what's the point of comparing cities in Maine with counties in California. To do so, I would have to build some kind of quadtree index.
  • any other...
WWMD?
--

when small people start casting long shadows, it is time to go to bed

Comment on Speeding up point-in-polygon
Select or Download Code
Re: Speeding up point-in-polygon
by bobf (Monsignor) on Jul 23, 2006 at 17:49 UTC

    I'd consider adding an initial screen that uses a much faster algorithm. If a point passes the screen, then the point_in_polygon routine could be called (for those that don't already own it, the Wolf Book is Mastering Algorithms with Perl). Both of the approaches I describe are similar to your Maine/California example, and both would work well with a database to store the results and avoid repeat calculations.

    The first approach would be to partition your space into grids (the size of which should be determined based on how tightly your polygons and points are clustered). Determine which partition the polygons and points are in, then for each partition only test that set of polygons and points.

    The second approach is a little more complex, but it might be worthwhile if your polygons and points are tightly clustered and the grid partition method only minimally reduces the calculations. Circumscribe a circle around each polygon1 and store the center point and radius for the circle. The initial screen then becomes a very simple (and fast) comparison of the distance between the center of the circle and the point in question, vs the radius of the circle. This method won't work as well if the polygons are very irregularly-shaped, as you'll get a lot of points that fall into the crevices2.

    Alternatively, you could look into coding the algorithm in C (using perlxs) or PDL.

    HTH

    1Update 1: Not all polygons can be circumscribed in according to the true definition of "circumscribe", which states that every vertex of the polygon must be on the circle. For the purposes of this problem, I intended to suggest that any circle large enough to encompass the polygon could be used. This could be determined by trial and error using unit increments to the radius, or it could be determined more precisely by finding the largest distance between two vertices (which becomes the diameter of the circle, therefore the center is the midpoint of that segment).

    Please note that there is room for error in this approach unless every point of the polygon is tested to ensure they are all contained within the circle. Nonetheless, it may be good enough for a first pass, and a suitable margin of error be used (doubling the size of the circle, etc) to minimize the chance for error.

    2Update 2: NiJo had a great suggestion - instead of using a circle as I initially proposed, use a rectangle. The bounds would be much easier to calculate and it would likely function just as well (if not better, since it would be guaranteed to contain the entire polygon) for a first pass screen.

      I would spend a lot of time pre-processing to cut down on the amount of looping you have to do.

      In describing time complexity I'll use M for the number of points you have and N for the number of rectangles.

      First, wrap every polygon in a rectangle as described in this tip. This is O(n).

      Then, create four lists of rectangles sorted by xMin, xMax, yMin, and yMax. This is O(n log n) if you use quicksort.

      Now, you cluster the rectangles into super-rectangles. How you do this is up to you, and should probably be informed by your data. If its evenly distributed you can do this in essentially O(n) time by simply using a grid system.

      If it is not evenly distributed, you do something like the following recursively: for any super-rectangle, roll randomly from the set of xMin, xMax, yMin, and yMax. Go to the median of the list you just selected (this should be constant time if your lists are backed by arrays). Draw a line there separating half of the points on one side of the line and half on the other. i.e. if you picked yMax you draw a horizontal line there. Then you use an O(n) loop to assign rectangles to either or both of the new slightly-less-super rectangles. If a rectangle contains more than some arbitrary threshhold of your total number of rectangles, you recursively subdivide it. Otherwise, you stop that level of the recursion and add that superrectangle to your superrectangle list. (Note, you'll want a sanity check on your recursion depth here, as it is possible but unlikely in practice that this will loop forever otherwise)

      So now you've got a list of super-rectangles, each of which has a list of sub-rectangles, each of which is associated with a polygon. Following so far? Now sort the list of super-rectangles by number of rectangles (smallest to largest), and sort each list of sub-rectangles by area (largest to smallest). The goal of both of these sorts is to maximize the probability that you will quickly find one qualifying polygon and be able to kill the search, skipping huge swathes of your M*N search space.

      Then, and only then, you get to (forgive non-Perl pseudocode),

      for(foo in points) for(bar in superRectangles) if (bar contains foo) for (foobar in subRectangles of bar) if (foobar contains foo) if (polygon of foobar contains foo) do whatever loop to next point
Re: Speeding up point-in-polygon
by Ieronim (Friar) on Jul 23, 2006 at 18:02 UTC
    How often will you need to perform such task? How big are polygons? How many polygons can you store in memory at once?

    Reading the polygons from file at every turn of the cycle is surely a bad idea. You need to store as many polys as you can in the memory and match the points against them. then throw away all matched points, as the polygons don't intersect, and match all left points against the next portion of polys. This advice is true in any case.

    Maybe you can improve your matching function — look at the source code of Math::Poly::Calc, the polygon_contains_point() sub. Rewriting it as XS will be even faster. The algorithms can be found here. This is a hard way to go—do it only if the memory caching does not give enough performance gain.

    These simple optimizations might not be enough if you need to perform this task often, and they will work better in combination with bobf's algorithmic improvements.


         s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print
Re: Speeding up point-in-polygon
by explorer (Chaplain) on Jul 23, 2006 at 19:10 UTC
    1. Install Math::Geometry::Planar
    2. Edit Planar.pm and add
      use Memoize; memoize('CrossProduct');
    3. Buy more RAM :)
    4. Insert: "next polygon if point's coords are out of polygon's box limits" (2 conditionals)
    5. Use the isinside function of Math::Geometry::Planar.

    Update: Added the 'point out of the box' line.
    Update: I used this solution the last year. Other solution is to use the clipping routines of GPC library (in CPAN also) or use Inline::C.
Re: Speeding up point-in-polygon
by NiJo (Friar) on Jul 23, 2006 at 20:09 UTC
    "Knowing your data" is the first step when optimizing. I'd plot, look and think about a managable subsample.

    Then it could make sense to extend and simplify the screening algorithm idea.

    The cartesian data makes it much easier to use squares. I'd reduce each polygon to a 'outer' square (min_x, max_x, min_y, max_y). Databases are good at searching, so the list of potential points is just a sql query away.

    The next step is "punching a hole" into the polygon. This could find points that are surely inside. Iteratively extending an 'inner' box from the center of the outer box would work if the center is inside the polygon.

      Ok, so this seems like a possible approach, but applied in reverse (update: in other words, instead of foreach point {foreach poly}}, I am going to do foreach poly {get all points {foreach point {more fine-grained}}..
      1. Generate a SQL table of x,y of all the points
      2. foreach rect bound of poly, find all the points that lie within -- this could be done via SQL -- SELECT points WHERE (point.x > poly.xmin) AND (point.x < poly.xmax) AND (point.y > poly.ymin) AND (point.y < poly.ymax)
      3. do more fine-grained point-in-poly analysis on the points SELECTed in #2 above, either by the said "Wolf Book" algorithm, or just use Math::Geometry::Planar
      Sounds like a plan.
      --

      when small people start casting long shadows, it is time to go to bed
        If you can reduce the data to cartesian coords. and the entire zone can reside into RAM, PDL is the faster solution.

        Example: If $zone is a piddle of bytes with an 1 where the point's coords, you can retrieve the coords of points into polygon's box limits with:
        $area = $zone->slice("$xmin:$xmax,$ymin:$ymax"); $coord_points = wichND( $area );
        Now, you will have the coords of all points into the $area:
        @x_coords = list $coord_points->slice("(0),:"); @y_coords = list $coord_points->slice("(1),:");
        This is the fast clipping method if you can reduce the data universe to cartesian coords.

        And PDL is a C compiled library.
Re: Speeding up point-in-polygon
by BrowserUk (Pope) on Jul 24, 2006 at 00:41 UTC

    Caching the results of your point-in-poly function will not yeild any performance gain unless you have duplicate points. And if you have duplicate points you'd almost certainly be better off sorting & grouping them so that you only look each point up once as the space requirements for the caching is non-trivial.

    Personally, I'd pre-process the polys to group them by their xmin bound, and subgroup the by their ymin bound. That way, you would only do a full point-in-poly comparison of each point against a small subset of the total polys. Eg.

    my %hashedPolys; for ( 0 .. $#polys ) { push @{ $hashPolys{ xmin( $polys[ $#polys ] ) % 100 }{ ymin( $poly +s[ $#polys ) % 100 } }, pop @polys; } ... for my $point ( @points ) { for my $poly ( @{ $hashedPolys{ $point[0] % 100 }{ $point[ 1 ] % 1 +00 } } ) { if( pointInPoly( $point, $poly ) ) { ## take whatever action is appropriate } } }

    The choice of granularity (x & y) for the hashing (where I've used % 100) will depend very much on the size and shape of your total coordinate space and the maximum size (enclosing rectangle), of your polys.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Speeding up point-in-polygon
by sgifford (Prior) on Jul 24, 2006 at 00:52 UTC
      PostGIS is my last resort... for simplicity sake, I didn't even want to use any SQL, but might try SQLite. If all else fails, then I will work with PostGIS (always looking for the simplest possible thing that could work well).
      --

      when small people start casting long shadows, it is time to go to bed
        A good point, though sometimes simplicity can be very complicated. :) Please post back if you find any useful GIS stuff for SQLite or Berkeley DB; I've been meaning to look, but haven't yet.
Re: Speeding up point-in-polygon
by Dervish (Friar) on Jul 24, 2006 at 07:01 UTC
    One final consideration: is Perl the best choice for implementation environment for this task? A repeated tight loop through a mathematically bound calculation just calls out for a compiled program of some kind. I'll admit, I'm no expert on Perl performance, but sometimes tasks that are 'fast enough' turn out to not be fast enough when you do them 1 trillion times, and it's worth taking the extra effort to write it in a more specialized language.
Re: Speeding up point-in-polygon
by bart (Canon) on Jul 24, 2006 at 10:01 UTC
    Just some rough ideas... You're using a brute force approach on lots and lots of polygons. Refine it.

    I have the impression that usually (I'm thinking of 3D modeling and raytracing, in the first place), these polygons tend to be grouped in cluster areas of relatively small sizes. You should therefore group the polygons in polygon groups, create an outside box or a superpolygon (the polygon that contains all polygons in the group) and check if a point is inside that superpolygon. If not, you can skip the whole group.

    As a very rough example, suppose you can split your screen into 4 quadrants, each containing about 1/4 of the polygons. You test for each point if it falls in such a quadrant, and only for that quadrant, test the polygons in the way you did. You'll get an average speedup of close to 4x already.

    For maximum effectiveness, you should get a lot of those groups first, that all span as little area as possible. So, the object is to get groups out of all of your polygons first, so this splits up into as small area as possible. The areas may overlap, but preferable as little as possible...

    So, how can you achieve that?

    For a start, I'd take out the really large polygons. You'll likely have to brute force those anyway.

    But I suspect most polygons will be really tiny, almost not much larger than dots. Well, you could try using clustering algorithms for dots — on their centers of gravity, for example.

    To put those into groups, there's the algorithm that belg4mit asked about a little while ago. You could try that out for size. Oh yeah, I turned that into Perl code. I almost forgot. ;-)

Re: Speeding up point-in-polygon
by Moron (Curate) on Jul 24, 2006 at 11:15 UTC
    Pursuing your own 'prune the unnecessary' idea led me to this idea:

    Preparation: Find the overall envelope and divide it into equal rectangles creating a grid isomorphic to (0..root(max(x) - min(x))) X (0..root(max(y) - min(y))). This creates discrete grid co-ordinates which can be computed for each point. Store the polygons first as an array and then make a hash(x-grid-coord) of hash(y-grid-coord) of point(=small array) of hash of indexes into the polygon array. Wrap all of this into another hash to store both in a Storable. This has the effect of making a new-style hash (associative array) function as an old-style hash (i.e. indirect addressing that buckets neighbours together in the storage space, to drastically reduce lookup time).

    Iteration: To find which polygons contain a test point then requires computing the grid reference of the point and looking up in the hash of hash for which polygons include it. After the one-off preparation is complete, although that is large, this reduces subsequent iterations to one per point, with the nested hash lookups substituting for what was previously a cross-product of iterations.

    Update: Could also let the points themselves (ungridded) loose on a hash of hash and test to see which approach is quicker - I have some expectation that the overhead of computing co-ordinates (in the grid) (bucketed hashing) should win versus lookup with large hash element counts when using raw hashing, but by no means certainty.

    -M

    Free your mind

Re: Speeding up point-in-polygon
by zentara (Archbishop) on Jul 24, 2006 at 16:41 UTC
    This is just a bit of brainstorming, possibly showing my Tk mindset. I know you posed the question in terms of rigourous mathematical results, but I was thinking of using the Tk Canvas and it's overlapping tags method, to do somethink like this. Of course, you do need a Xserver going, but you also gain the benefit of being able to setup a realtime display of your point-queries. And of course, you have the problem of drawing all your states, counties and cities.
    #!/usr/bin/perl use warnings; use strict; use Tk; my $top = new MainWindow; my $c=$top->Canvas->pack; my $circle = $c->createOval(30,30,100,100, -fill => 'blue', -tags =>['circle'], -stipple => 'gray12', ); my $rect1 = $c->createRectangle(10,10,44,44, -fill => 'green', -stipple => 'gray12', -tags =>['rect1'], ); my $rect2 = $c->createRectangle(93,93,200,200, -fill => 'yellow', -tags =>['rect2'], -stipple => 'gray12', ); my $poly1 = $c->createPolygon(0,0, 44,44, 55,55, 90,90, 200,200, 10,10 +0,0,0, -fill => 'red', -smooth => 1, -splinesteps => 100, -stipple => 'gray12', -tags =>['poly1'], ); $c->Tk::bind("<Motion>", [ \&print_xy, Ev('x'), Ev('y') ]); &print_xy($c, 42, 42); MainLoop; sub print_xy { my ($canv, $x, $y) = @_; print "(x,y) = ", $canv->canvasx($x), ", ", $canv->canvasy($y), "\n"; #my $x1 = $x+1; #my $y1 = $y+1; #it will actually use a zero size rectangle my (@current) = $canv->find('overlapping', $x, $y, $x, $y); foreach my $id(@current){ print $canv->gettags($id),' '; } print "\n"; }

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Speeding up point-in-polygon
by swampyankee (Parson) on Jul 24, 2006 at 19:33 UTC

    This is a well-studied problem, and your problem set many not even come into the category of "big" by the computational geometry mavens. Have you looked at PNPoly , Computational Geometry Algorithm Library or Graphic Gems? Another possibility would be to use a constrained Delaunay triangulation.

    Of more relevance to Perl, writing a Perl extension is one way to go. I think that using a specialized computational geometry library is likely to be more satisfying, though. Try CGAL.

    emc

    Login incorrect.
    Only perfect spellers may
    enter this system.

    Jason Axley

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://563130]
Approved by Joost
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (13)
As of 2014-04-24 12:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls