Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Speeding up point-in-polygon -- take two

by punkish (Priest)
on Aug 28, 2006 at 04:16 UTC ( #569929=perlquestion: print w/ replies, xml ) Need Help??
punkish has asked for the wisdom of the Perl Monks concerning the following question:

A month or so ago I asked a question about speeding up point-in-poly tests Speeding up point-in-polygon. I received some excellent suggestions, and I have implemented one of them. Today, I am posting my solution here just in case someone can suggest methods for further gains in efficiency. To recap the problem --
  1. 210k polys, 5.25 million points.
  2. For each point, update its attribute (name)) with that of the poly it falls within.
  3. All points fall in one and only one poly.
  4. All polys may not contain a point.
I built the following tables and indexes in SQLite 3.3.5 (I am using DBD::SQLite 1.12 on WinXP with ActiveState Perl 5.8.8)

CREATE TABLE polys ( id INTEGER PRIMARY KEY, xmin REAL, ymin REAL, xmax REAL, ymax REAL, name TEXT, n INTEGER, ar_x TEXT, ar_y TEXT ) CREATE TABLE points ( id INTEGER PRIMARY KEY, x REAL, y REAL, name TEXT ) CREATE INDEX ix_polys ON polys (xmin, ymin, xmax, ymax) CREATE INDEX ix_points ON points (x, y)

Then, using Geo::ShapeFile, I calculated the bounding box of each poly and update xmin, ymin, xmax, ymax in the polys table. I also unpacked each of the polys and updated polys.n with the number of points in the poly, and polys.ar_x and polys.ar_y with a list of all the x coords and y coords that make up the poly. I store the list of coords as a comma-separate list that I can convert back into an array using split(/,/, $ar_x) and so.

Then I loop over each poly in the table and select all the points that might potentially fall within it using the following query --

SELECT py.name, py.n, py.ar_x, py.ar_y, pt.id, pt.x, pt.y FROM polys py JOIN points pt ON (py.xmin < pt.x AND py.ymin < pt.y AND py.xmax > pt.x AND py.yma +x > pt.y) WHERE py.id = ?

Since polys can be irregular, a given point may fall within its bounding box, but still not be within the poly. So I make a definitive check of that for every point selected in the above query against that poly, and if the is positive, I

UPDATE points SET name = ? WHERE id IN (?)

I used the most excellent Devel::Profiler to find and fix the bottlenecks, and I believe I have made it as efficient as I can. Most of my process time is now being spent doing the point-in-poly, which I do as many times as there are points, in fact, even more in case a point ends up falling in more than one bounding-boxes. I use the following algorithm lifted from Mastering Algorithms With Perl aka the Wolf Book. (this is where I use polys.n, polys.ar_x, polys.ar_y)

sub _pointIsInPolygon { my ($a_point, $n, $a_x, $a_y) = @_; # point coords my ($x, $y) = ($a_point->[0], $a_point->[1]); # poly coords # $n is the number of points in polygon. my @x = @$a_x; # Even indices: x-coordinates. my @y = @$a_y; # Odd indices: y-coordinates. my ($i, $j); # Indices. my $side = 0; # 0 = outside, 1 = inside. for ($i = 0, $j = $n - 1 ; $i < $n; $j = $i++) { if ( ( # If the y is between the (y-) borders ... (($y[$i] <= $y) && ($y < $y[$j])) || (($y[$j] <= $y) && ($y < $y[$i])) ) and # ...the (x,y) to infinity line crosses the edge # from the ith point to the jth point... ($x < ($x[$j] - $x[$i] ) * ($y - $y[$i]) / ($y[$j] - $y[$i]) + $x[$i] )) { $side = not $side; # Jump the fence. } } return $side ? 1 : 0; }

Have been testing extensively. The latest test --

>perl test_point_in_poly.pl 5000.10000.15000.20000.25000.30000.35000.40000.45000.50000.55000.60000 +.65000.70000.75000.80000.85000.90000.95000.100000. processed 100000 polys, updated 2948276 points Total time: 8451 wallclock secs (6340.54 usr + 200.52 sys = 6541.06 CP +U)

The above screen dump shows 100,000 polys processed (about half of total), and almost 3 million points within them updated with the corresponding name (almost 60% of the points). Time taken is slightly less than 2.5 hours. Extrapolating, the entire set should take less than 5 hours. That is on my piddly Dell laptop which has now only 200 Mb of space left. This process is entirely CPU bound, so it will likely be way more quick on the production machine with its faster processor. There is prep work required before and after, but even with that, it should be in the range of about 5 hours or so.

Any further suggestions to squeeze further efficiencies would be most welcome.

--

when small people start casting long shadows, it is time to go to bed

Comment on Speeding up point-in-polygon -- take two
Select or Download Code
Re: Speeding up point-in-polygon -- take two
by merlyn (Sage) on Aug 28, 2006 at 05:17 UTC
    Use the right tool for the right job. Postgres contains a built-in function that computes whether a geometric object contains another geometric object. Put your data into a Postgres database, and ask the right questions, and you'll be doing it as fast as possible.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      >Postgres contains a built-in function that 
      >computes whether a geometric object contains 
      >another geometric object.
      
      PostGres/PostGIS is can't be used because of project constraints (no complex db servers, etc.). Nevertheless, been there, done that. To really use the PostGIS functions (I am on that mailing list), I have to take account of other issues such as correct projections. Makes the task more complex than I need it to be. And, it is not an easily portable solution.

      In any case, my testing shows that speed is comparable to what I have above.

      --

      when small people start casting long shadows, it is time to go to bed
Re: Speeding up point-in-polygon -- take two
by BrowserUk (Pope) on Aug 28, 2006 at 11:49 UTC

    Could you post an example of your polygons; and state the bounds for the overall coordinate space please.

    I believe that there is a way to do this 1 or even 2 magnitudes faster.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      > Could you post an example of your polygons; and 
      > state the bounds for the overall coordinate space 
      > please.
      
      Think of zipcodes for the entire US... in lambert conformal projection, if that helps (although that doesn't matter, because both points and polys are in the same proj).
      --

      when small people start casting long shadows, it is time to go to bed

        I guess that I could go off, find a map of the US and work out the maximum coordinate space. I could then try and find a source of zipcode polygons somewhere on the web and so determine the precision with which the are marked.

        But presumably you already know what you are dealing with?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Speeding up point-in-polygon -- take two (Under 40 seconds!)
by BrowserUk (Pope) on Aug 28, 2006 at 13:22 UTC

    The following shows the output from several runs of a benchmark looking up 5.25 million randomly generated points in a coordinate space of 5000x5000 that contains 1/4 million polygons:

    c:\test>569929-b 1 trial of 5,25e6 points in 2.5e5 polys (33.094s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (36.656s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (33.250s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (35.094s total)

    The program runs in under 40 seconds and requires ~100MB. No database.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      > The following shows the output from several runs 
      > of a benchmark looking up 5.25 million randomly 
      > generated points in a coordinate space of 5000x5000 
      > that contains 1/4 million polygons:
      
      Ya but, are your polys regular? If so, that changes the whole game. It is because my polys are irregular that I have to do a definitive test of every point against all the vertices of every poly.

      Please post your code as it may really give me some great ideas.

      --

      when small people start casting long shadows, it is time to go to bed
        Ya but, are your polys regular?

        Using my method, this doesn't affect the performance.

        What does affect it is the size of the coordinate space and the minimum granularity of points.

        Update: I guess someone doesn't believe me ;)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Speeding up point-in-polygon -- take two
by roboticus (Canon) on Aug 28, 2006 at 14:20 UTC
    If this is a task you're going to do repeatedly, and your polygon list is relatively static, then you should consider redefining the problem. I suspect that the loop and indexing operations my be consuming a considerable amount of your time. So maybe you can change the problem to unroll the loops.

    For example, instead of storing the polygons as they're defined, break the polygons into horizontal strips with with P[0]=lower-left, P[1]=upper-left, P[2]=upper-right, and P[3]=lower-right. In the case that the top or bottom edge is a point, you'll have the same thing, where P[1]==P[2] or P[3]==P[0], respectively. Then you'll insert these simpler polygons (quadrilaterals, nominally) into the database, so you can use a simpler test. Thus, we break this polygon:

    *--------------------* \ A / * *-----* / / \ \ / / \ \ / *--------* \ / *

    into these three:

    *--------------------* \ A / *--*-----*-------* / \ \ A(3)/ / A(2) \ \ / *--------* \ / *
    Since we have quadrilaterals (and triangles, sort-of), we can simplify your subroutine into something resembling:

    sub _pointIsInQuadrilateral { my ($a_point, $a_x, $a_y) = @_; # point coords my ($x, $y) = ($a_point->[0], $a_point->[1]); # poly coords # $n is the number of points in polygon. my @x = @$a_x; # Even indices: x-coordinates. my @y = @$a_y; # Odd indices: y-coordinates. if ($x < ($x[1]-$x[0]) * ($y-$y[0]) / ($y[1]-$y[0] + $x[0]) ) { # $x is left of the leftmost edge, so it's not in polygon return 0; } if ($x < ($x[3]-$x[2]) * ($y-$y[2]) / ($y[3]-$y[2] + $x[2]) ) { # $x is right of the rightmost edge, so it's not in polygon return 0; } # Point is inside quadrilateral! return 1; }

    As you can see, by redefining the problem and using quadrilaterals with parallel top and bottom edges, we get the following benefits:

  • We omit one of the parameters ($n)
  • We eliminate the Y test altogether
  • We also unroll the X coordinate tests to a simple pair of left-right tests
  • Your polygon structure in the database can be simpler (see also note below).

    So we're simplifying the code by increasing our polygon count. Even with the increased number of shapes in your table, I'm guessing that select statement will probably be about the same speed. Yes, there'll be more polygons, but they'll have smaller bounding boxes, so there'll be fewer false hits. I also think (without testing!) that _pointIsInQuadrilateral should be considerably faster.

    This is only one way you could redefine the problem. After studying, you might find a different way to redefine the problem to speed things up. If you try this approach, please let me know how it turns out!

    Note: One other advantage is that with a fixed number of points per polygon, you can store the point coordinates in table directly as numbers, rather than having a list of coordinates in text. Omitting the parsing of text to numbers may provide another speedup....

    --roboticus

      Another thought--

      If the aforementioned suggestion helps, then it may be possible to improve performance a bit more by adding even more polygons: You might break your polygons into rectangles, left triangles and right triangles, where a rectangle is 'definitely inside' a polygon, and the triangles are pasted to the left and right side of your set of rectangles to build the polygons you desire.

      With the rectangles, you can avoid the subroutine call altogether. If your select statement returns one, you have the answer "Yep, it's in polygon A". If it's a triangle, then you only have one edge to check. (For a left triangle, your point is inside the polygon of $x is to the right of the edge; and for a right triangle, your point is inside if $x is to the left of the edge.)

      By chewing on the problem, I'm certain you'll come up with a few other things to try...

      --roboticus

      Okay--Yet another thought (and then I promise I'll shut up for a while!).

      How important are exact results? If somewhat "fuzzy" edges are acceptable, you could use a quadtree, and just insert enough nodes to get the precision you want. Then you'd only need a select statement to determine whether your point is internal to the polygon.

      --roboticus

      Who intends to stop thinking about this question for at least an hour!

      If this is a task you're going to do repeatedly, and your polygon list is relatively static, then you should consider redefining the problem. I suspect that the loop and indexing operations my be consuming a considerable amount of your time. So maybe you can change the problem to unroll the loops.
      Yes, the polys are relatively static, although the points are not so. Nevertheless, I do indeed unroll both polys and points, and move them into the SQLite db. That way I avoid going back to the files repeatedly.

      You are correct about "loop and indexing operations my be consuming a considerable amount of your time". Below is the output from my latest run of 5000 polys using Devel::Profiler. More than 220k points were evaluated and updated, and my point-in-poly test was performed 220k times.

      Total Elapsed Time = 251.1784 Seconds            
        User+System Time = 189.3034 Seconds            
      Exclusive Times            
      %Time  ExclSec  CumulS  #Calls  sec/call    Csec/c    Name
      72.3   136.9    136.91  220726  0.0006      0.0006  main::_pointIsInPolygon
      9.09    17.21    17.216   5001  0.0034      0.0034  DBI::st::fetchall_arrayref
      6.25    11.82    11.825  13887  0.0009      0.0009  DBI::st::execute
      4.84     9.162  178.99       1  9.1622    178.99    main::updatePoints
      1.82     3.442    3.442   3885  0.0009      0.0009  DBI::db::commit
      0.16     0.298    0.298   5001  0.0001      0.0001  DBI::st::fetchrow_array
      0.07     0.141    0.141   3887  0           0       DBI::st::finish
      0.01     0.01     0.01       3  0.0033      0.0033  IO::Handle::new
      0        0        0.01       5  0           0.002   Geo::ShapeFile::get_bytes
      0        0        0          6  0           0       DBI::_new_handle
      0        0        0.01       5  0           0.002   Geo::ShapeFile::get_handle
      0        0        0          1  0           0       DBI::connect
      0        0        0          4  0           0       DBI::db::prepare
      0        0        0          4  0           0       DBI::_new_sth
      0        0        0          3  0           0       Geo::ShapeFile::dbf_handle  
      
      What I really have to figure out is to reduce those tests. I have an idea which I have to now figure out how to implement. Basically, it is like so --

      Once I get a bunch of potential points within the rect of a poly, I should find the 4 outermost points in each dimension. Once I determine those four definitively, all points within those 4 would be also within that poly, and I wouldn't have to do the checks for them.

      Now the hard part -- realize the above hypothesis.

      --

      when small people start casting long shadows, it is time to go to bed

        Once I get a bunch of potential points within the rect of a poly, I should find the 4 outermost points in each dimension. Once I determine those four definitively, all points within those 4 would be also within that poly, and I wouldn't have to do the checks for them.

        If I understand you correctly, then given the polygon below and the four points marked as dots ("."), wouldn't that method determine that the point marked with an "x" is inside the polygon?

        _________ | . ./ | / | /x | \ | \ | . .\ ---------

        If all of your polygons were guananteed to have internal angles > 90 degrees that might work, but you only said that they were "irregular". I had a similar idea yesterday (except mine was finding the maximum internal bounding rectangle for the polygon rather than using the query points themselves), and discarded it for this reason. :-)

        punkish:

        What I really have to figure out is to reduce those tests.
        That's why I made my original suggestion. There are two ways that breaking the polygons into simple shapes will help you:

        • With known shapes, your _pointIsInPolygon function is simpler, so you can use quick exits and/or the compiler can optimize it, and
        • you can increase your density of polygon vs. bounding boxes.
        In your original _pointIsInPolygon, you use variable indices into your vertex arrays, as well as adding a loop with it's inherent termination condition. By breaking your polygons into triangles or quadrilaterals, you can use fixed indices and omit the loop. That may significantly increase the subroutines speed. Also, your algorithm requires that you examine every edge in the polygon. By enforcing an order to the edges, you may be able to return early from the routine by taking advantage of that knowledge.

        Regarding the 'density of polygon vs. bounding boxes': A rectangle aligned with your X and Y axes has a 100% chance of containing a point returned by your SELECT statement. An arbitrary triangle has a 50% chance of containing a point in the bounding rectangle. So breaking your polygons into triangles and (aligned) rectangles will have a worst case coverage of 50%. And a higher coverage density directly correlates to a lower false-positive rate. (I.e., with a higher coverage, you'll need fewer tests because your SELECT will hit fewer bounding boxes.) With arbitrary polygons, you can have nearly *any* hit rate:

        *--------------------* *--------------------* *--------------------* | | \ / |*------------------*| | | \ / || || | ** \ / || || *-------------------* *-------------* ** ** ~99% ~90 ~32%

        While I still think my original suggestion would be interesting, you state:

        Yes, the polys are relatively static, although the points are not so. Nevertheless, I do indeed unroll both polys and points, and move them into the SQLite db. That way I avoid going back to the files repeatedly.
        By this, I'm assuming you mean that the points are in the same rough relative position in the polygons, meaning that (1) the points are connected in the same order, and (2) the edges will never cross. For example, the shape on the left could morph into the shape in the center, but not the one on the right:

        a----g g g / | / \ / \ / | / \ / \ b | == / \ != / \ \ e---f / \ / \ \ \ a-b f a c e c---d / / \/| / c---d / /\| / \ / / b / e d-----f
        In that case, breaking the polygons into X-aligned trapezoids (my original suggestion) may be a bit too restrictive. Perhaps using arbitrary quadrilaterals (or triangles) would give you (a) enough coverage density (i.e.just breaking the shape into quadrilaterals instead of trapezoids may simplify things enough. So you'd increase your coverage, and by breaking the polygons into triangles, you could still simplify your _pointIsInPolygon subroutine to eliminate the loops and variable indices.

        --roboticus

Re: Speeding up point-in-polygon -- take two
by hardburn (Abbot) on Aug 28, 2006 at 15:04 UTC

    Using a bounding box is a good start, but a bounding circle should be even better. A circle only needs to test one dimention:

    if( $circle->distance_from_center( $x, $y ) <= $circle->radius ) { # Point ($x, $y) is in circle }

    I've also found that dropping some of the calculations into C can speed it up quite a bit without being terribly complex to interface back to Perl.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: Speeding up point-in-polygon -- take two
by explorer (Chaplain) on Aug 28, 2006 at 16:35 UTC

    Unroll the condition and duplicate the process (or subroutine):

    if ( $y[i] <= $y ) { if ( $y < $y[$j] ) { # process ... } } elsif ( $y >= $y[$j] ) { # the same process .. }
    Or a simple calculation:
    if ( ( $y[$i] - $y ) * ( $y - $y[$j] ) > 0 ) { # process
    Remember: Memoize is your friend.
Re: Speeding up ... (O(N) determination of point in polygon regardless of complexity)
by BrowserUk (Pope) on Aug 28, 2006 at 20:05 UTC

    In Speeding up point-in-polygon -- take two, punkish was looking for ways of speeding up point in polygon calculations. Various ordering and searching methods come to mind which improve the performance by preselecting the polygons against which each point is compared. Combined with performing the math in C, these techniques can reasonably be expected to improve the performance by up to an order of magnitude.

    However, I learned long ago that the quickest way to do something is to avoid doing it at all. And in the case of computer math, if you can avoid doing any math at all, you'll definitely speed things up. A common technique for avoiding math, especially time expensive math is to trade memory for speed and use a lookup table. Eg. A lookup table of logarithms as those of use who used log tables at school will remember, is much faster than evaluating the polynomial expression required to calculate them.

    For the point in polygon problem of determining which zip-code a particular point falls in, the simplest way of doing this is to construct a 2-dimensional lookup table of the complex polygons representing each zip-code area and store the zip-code (or a numeric value that represents the zip-code), at the crossover of each pair of coordinates. Determining the zip-code from a point then becomes a simple task of retrieving the value stored at the point index by the (X,Y) pair of the point in question.

    My solution to this problem recognises that an image-file in which each of the polygons has been plotted, each using a unique colour, is simple a large scale map. It makes the lookup a simple process of loading the image and doing a getPixel( x,y ) of the relevant point. The color (index) returned is a direct representation of the zip-code at that point on the map.

    The nice thing about this process is that it does not matter how complex your polygons are, the lookup is still O(n) in the number of points to be looked up. Obviously, for a map the size of the US, some scaling is required to achieve the required accuracy.

    Note: Given accurate enough image plotting, this even caters for the case of a single street (red line in the purple area), having a different zip-code to the area surrounding it on all sides.

    Accuracy & scale

    Depending upon the required accuracy, a single image file sufficient to cover the entire USA, may be impractical to load into memory. In this case, it may be necessary to split the map into 4 or 16 ro 64 or some other convenient number of smaller images. You would then pre-order the points so as to minimise the need to switch images. Ie. You load the first image and only look up points that fall within it's boundaries before loading the second image.

    My proof-of-concept code consisted of looking up 5.25e6 randomly generated points in a single 5000x5000 image containing 2.5e5 polygons. It did this in around 35 seconds on my machine which makes it 50,000 times faster than the OP's method. Even if 256 separate images have to be loaded, it will still be at least two order of magnitude faster.

    For ease of generation, the 'polygons' I used were regularly spaced, equal-sized circles, but as noted earlier, once plotted, neither the number nor complexity of the polygons has any affect upon the performance of the method.

    Proof-of-concept code

    Image generator (polygon plotter)

    I've incremented the color used for each polygon by 64 (2563 / 250,000 = 67.1) in an attempt to produce a visually pleasing map, but in reality the map does not need to be visually pleasing so a standard by-1 increment could be used which would allow a 24-bit image to handle up to 16 million polygons with no affect on performance.

    #! perl -slw use strict; use GD; my $map = GD::Image->new( 5000, 5000, 1 ) or die $!; my $index = 0; my $bg = $map->colorAllocate( 0, 0, 0 ); $map->filledRectangle( 0, 0, 4999, 4999, $bg ); for my $y ( map{ $_ *10 +5 } 0 .. 499 ) { for my $x ( map{ $_ * 10 + 5 } 0 .. 499 ) { my $color = $map->colorAllocate( unpack 'xCCC', pack 'N', $ind +ex += 64 ); # print $color; $map->filledEllipse( $x, $y, 10, 10, $color ); } } open IMG, '>:raw', '569929.png' or die $!; print IMG $map->png; close IMG;

    Random point generator

    #! perl -slw use strict; open POINTS, '>', '569929.points' or die $!; printf POINTS "%d:%d\n", int( rand 5000 ), int( rand 5000 ) for 1 .. 5 +.25e6; close POINTS;

    Lookup process benchmark

    #! perl -slw use strict; use Benchmark::Timer; use GD; my $map = GD::Image->newFromPng( '569929.png', 1 ); open POINTS, '<', '569929.points' or die $!; my $T = new Benchmark::Timer; $T->start( '5.25e6 points in 2.5e5 polys' ); while( <POINTS> ) { chomp; my( $x, $y ) = split ':'; my $poly = $map->getPixel($x, $y ); # printf "Point ( %4.4d, %4.4d ) found in poly: %d\n", $x, $y, $pol +y / 64; } $T->stop( '5.25e6 points in 2.5e5 polys' ); $T->report; close POINTS;

    Timings of several runs

    c:\test>569929-b 1 trial of 5,25e6 points in 2.5e5 polys (33.094s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (36.656s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (33.250s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (35.094s total) c:\test>569929-b 1 trial of 5.25e6 points in 2.5e5 polys (33.734s total)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I have been thinking about BrowserUk's proposed solution, and have concluded this to be one of the most innovative strategies I have seen in a long time. Using a png essentially as a lookup table... even though I haven't tried the solution yet, I have to say "wow! Bravo!"

      Well, I have been thinking about it, and there are a few quibbles, and possible mods --

      As noted in the solution, a single image to cover the entire country would be simply too big, and would have to be chopped up. That creates problem of juggling images in memory, as there is really no easy way to sort the points and process them in image-based batches.

      The pixels in the image can be mapped to the smallest unit of the coordinate, but fractional units for a point like 1.24435, 455.22452 get all fudged up.

      If I could figure out how to generate all the pixel coords for a polygon, I wouldn't even need GD. I could simply store each pixel (something akin to an INTEGER UNIQUE NOT NULL would work just fine) and its associated attribute in a Berkeley Db hash. The advantage of Bdb would be that I wouldn't have to worry about image file sizes and chunking up images as in the GD approach.

      So, does anyone know of a way to generate all the coord pairs for a given polygon and a resolution?

      By the way, I tested my approach posted in the OP that started this thread. Not including the data prep, it took me about 3.25 hours to update 5.25 million points against 210k polys. Nevertheless, compliments to BrowserUk for a truly innovative solution.

      --

      when small people start casting long shadows, it is time to go to bed

        Converting a polygon to pixels is something that graphics libraries are highly optimized at. So I'd go with the graphics library approach.

        It may be as simple as drawing a border around each polygon in white. If your lookup finds a white pixel, then look at nearby pixels to find the colors of polygons to test the 'long' way. Then you don't have to have a really high-resolution image.

        If you've got very thin strips, then that may require special handling. Assigning one more color to each polygon that has such should cover it, though (that color means try that polygon but also try polygons represented by the color of nearby pixels).

        An alternate approach would simply be to draw the polygons without borders (being sure to draw any really thin strips last) and always look at the 9 closest pixels instead of just 1 pixel.

        - tye        

        as there is really no easy way to sort the points and process them in image-based batches.

        Assuming your zipcode polygons are obtained from somewhere like here, the each of the .e00 files contains the polygons for the zipcodes in each state. By building one image per state you will have 48/50 images and 48/50 bounding boxes. Pre-selecting or ordering the points by those bounding boxes (much as you currently do using the bounding box of the individual polys0, is simple, and as there are many fewer states than zips, much quicker.

        If the points originate in a DB, then something like

        select x,y from points where x >= state.minx and x <= state.maxx and y >= state.miny and y <= state.maxy;

        would allow you to load the state images one at a time and select only the points (roughly) within state for lookup.

        The pixels in the image can be mapped to the smallest unit of the coordinate, but fractional units for a point like 1.24435, 455.22452 get all fudged up.

        Taking Texas as an example (guessing that it is the largest state from looking at a map), width covers 13 degrees of Longitude and 773 miles. That means that each arc minute represents ~1 mile and each arc second ~88 feet. In decimal degrees, each 0.0001 of a degree represents ~32 feet. (Check my math, it's getting late!)

        So, if you represent the biggest state by an image 13000x11000 pixels, and multiply all the coordinates in your polys by 10,000 and truncate, each pixels will represent ~32ft X 32 feet. The image takes under 600MB when loaded in memory. If you have fully populated hardware, you could perhaps increase your resolution by a factor of 10 if need be. Remember that the majority of states are considerably smaller.

        There will obviously be some overlap between bounding boxes, which means some points will be lookup in 2 or 3 maps, but it is convenient that in large part, most state boundaries follow lines of longitude and latitude.

        There is even a module Geo::E00 for processing these files :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Speeding up point-in-polygon -- take two
by eric256 (Parson) on Aug 30, 2006 at 16:45 UTC

    3d renderers do this all the time and normaly pretty darn fast. Just polygons + a raycast to determine which polygon you hit. Why not break down the polygon into triagles and then sort those with a b-tree? Isn't that still how 3d games handle the process? Alternatly just use a 3d engine and load all the polygons in and raytrace ;) but that seems silly.


    ___________
    Eric Hodges
      Years later... I don't think that the ray projection will work without a method to see if the ray passes between two points that may make up the vertices of the polygon. - lartnec

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://569929]
Approved by imp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2014-07-26 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls