Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
A month or so ago I asked a question about speeding up point-in-poly tests Speeding up point-in-polygon. I received some excellent suggestions, and I have implemented one of them. Today, I am posting my solution here just in case someone can suggest methods for further gains in efficiency. To recap the problem --
  1. 210k polys, 5.25 million points.
  2. For each point, update its attribute (name)) with that of the poly it falls within.
  3. All points fall in one and only one poly.
  4. All polys may not contain a point.
I built the following tables and indexes in SQLite 3.3.5 (I am using DBD::SQLite 1.12 on WinXP with ActiveState Perl 5.8.8)

CREATE TABLE polys ( id INTEGER PRIMARY KEY, xmin REAL, ymin REAL, xmax REAL, ymax REAL, name TEXT, n INTEGER, ar_x TEXT, ar_y TEXT ) CREATE TABLE points ( id INTEGER PRIMARY KEY, x REAL, y REAL, name TEXT ) CREATE INDEX ix_polys ON polys (xmin, ymin, xmax, ymax) CREATE INDEX ix_points ON points (x, y)

Then, using Geo::ShapeFile, I calculated the bounding box of each poly and update xmin, ymin, xmax, ymax in the polys table. I also unpacked each of the polys and updated polys.n with the number of points in the poly, and polys.ar_x and polys.ar_y with a list of all the x coords and y coords that make up the poly. I store the list of coords as a comma-separate list that I can convert back into an array using split(/,/, $ar_x) and so.

Then I loop over each poly in the table and select all the points that might potentially fall within it using the following query --

SELECT py.name, py.n, py.ar_x, py.ar_y, pt.id, pt.x, pt.y FROM polys py JOIN points pt ON (py.xmin < pt.x AND py.ymin < pt.y AND py.xmax > pt.x AND py.yma +x > pt.y) WHERE py.id = ?

Since polys can be irregular, a given point may fall within its bounding box, but still not be within the poly. So I make a definitive check of that for every point selected in the above query against that poly, and if the is positive, I

UPDATE points SET name = ? WHERE id IN (?)

I used the most excellent Devel::Profiler to find and fix the bottlenecks, and I believe I have made it as efficient as I can. Most of my process time is now being spent doing the point-in-poly, which I do as many times as there are points, in fact, even more in case a point ends up falling in more than one bounding-boxes. I use the following algorithm lifted from Mastering Algorithms With Perl aka the Wolf Book. (this is where I use polys.n, polys.ar_x, polys.ar_y)

sub _pointIsInPolygon { my ($a_point, $n, $a_x, $a_y) = @_; # point coords my ($x, $y) = ($a_point->[0], $a_point->[1]); # poly coords # $n is the number of points in polygon. my @x = @$a_x; # Even indices: x-coordinates. my @y = @$a_y; # Odd indices: y-coordinates. my ($i, $j); # Indices. my $side = 0; # 0 = outside, 1 = inside. for ($i = 0, $j = $n - 1 ; $i < $n; $j = $i++) { if ( ( # If the y is between the (y-) borders ... (($y[$i] <= $y) && ($y < $y[$j])) || (($y[$j] <= $y) && ($y < $y[$i])) ) and # ...the (x,y) to infinity line crosses the edge # from the ith point to the jth point... ($x < ($x[$j] - $x[$i] ) * ($y - $y[$i]) / ($y[$j] - $y[$i]) + $x[$i] )) { $side = not $side; # Jump the fence. } } return $side ? 1 : 0; }

Have been testing extensively. The latest test --

>perl test_point_in_poly.pl 5000.10000.15000.20000.25000.30000.35000.40000.45000.50000.55000.60000 +.65000.70000.75000.80000.85000.90000.95000.100000. processed 100000 polys, updated 2948276 points Total time: 8451 wallclock secs (6340.54 usr + 200.52 sys = 6541.06 CP +U)

The above screen dump shows 100,000 polys processed (about half of total), and almost 3 million points within them updated with the corresponding name (almost 60% of the points). Time taken is slightly less than 2.5 hours. Extrapolating, the entire set should take less than 5 hours. That is on my piddly Dell laptop which has now only 200 Mb of space left. This process is entirely CPU bound, so it will likely be way more quick on the production machine with its faster processor. There is prep work required before and after, but even with that, it should be in the range of about 5 hours or so.

Any further suggestions to squeeze further efficiencies would be most welcome.

--

when small people start casting long shadows, it is time to go to bed

In reply to Speeding up point-in-polygon -- take two by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (6)
    As of 2014-08-23 07:48 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (172 votes), past polls