comment on

A month or so ago I asked a question about speeding up point-in-poly tests Speeding up point-in-polygon. I received some excellent suggestions, and I have implemented one of them. Today, I am posting my solution here just in case someone can suggest methods for further gains in efficiency. To recap the problem --

210k polys, 5.25 million points.
For each point, update its attribute (name)) with that of the poly it falls within.
All points fall in one and only one poly.
All polys may not contain a point.

I built the following tables and indexes in SQLite 3.3.5 (I am using DBD::SQLite 1.12 on WinXP with ActiveState Perl 5.8.8)

    CREATE TABLE polys (
      id   INTEGER PRIMARY KEY,
      xmin REAL,
      ymin REAL,
      xmax REAL,
      ymax REAL,
      name TEXT,
      n    INTEGER,
      ar_x TEXT,
      ar_y TEXT
    )
    
    CREATE TABLE points (
      id   INTEGER PRIMARY KEY,
      x    REAL,
      y    REAL,
      name TEXT
    )
    
    CREATE INDEX ix_polys ON polys (xmin, ymin, xmax, ymax)
    
    CREATE INDEX ix_points ON points (x, y)
[download]

Then, using Geo::ShapeFile, I calculated the bounding box of each poly and update xmin, ymin, xmax, ymax in the polys table. I also unpacked each of the polys and updated polys.n with the number of points in the poly, and polys.ar_x and polys.ar_y with a list of all the x coords and y coords that make up the poly. I store the list of coords as a comma-separate list that I can convert back into an array using split(/,/, $ar_x) and so.

Then I loop over each poly in the table and select all the points that might potentially fall within it using the following query --

    SELECT py.name, py.n, py.ar_x, py.ar_y, pt.id, pt.x, pt.y 
    FROM polys py JOIN points pt ON 
      (py.xmin < pt.x AND py.ymin < pt.y AND py.xmax > pt.x AND py.yma
+x > pt.y) 
    WHERE py.id = ?
[download]

Since polys can be irregular, a given point may fall within its bounding box, but still not be within the poly. So I make a definitive check of that for every point selected in the above query against that poly, and if the is positive, I

    UPDATE points SET name = ? WHERE id IN (?)
[download]

I used the most excellent Devel::Profiler to find and fix the bottlenecks, and I believe I have made it as efficient as I can. Most of my process time is now being spent doing the point-in-poly, which I do as many times as there are points, in fact, even more in case a point ends up falling in more than one bounding-boxes. I use the following algorithm lifted from Mastering Algorithms With Perl aka the Wolf Book. (this is where I use polys.n, polys.ar_x, polys.ar_y)

sub _pointIsInPolygon {
  my ($a_point, $n, $a_x, $a_y) = @_;

  # point coords
  my ($x, $y) = ($a_point->[0], $a_point->[1]);
  
  # poly coords
  # $n is the number of points in polygon.      
  my @x = @$a_x; # Even indices: x-coordinates.
  my @y = @$a_y; # Odd indices: y-coordinates.
  
  my ($i, $j);                          # Indices.
  my $side = 0;                         # 0 = outside, 1 = inside.
  for ($i = 0, $j = $n - 1 ; $i < $n; $j = $i++) {
    if (
        (
          # If the y is between the (y-) borders ...
          (($y[$i] <= $y) && ($y < $y[$j])) ||
          (($y[$j] <= $y) && ($y < $y[$i]))
        )
        and
          # ...the (x,y) to infinity line crosses the edge
          # from the ith point to the jth point...
          ($x
            <
            ($x[$j] - $x[$i] ) *
            ($y - $y[$i]) / ($y[$j] - $y[$i]) + $x[$i] )) {
         $side = not $side; # Jump the fence.
    }
  }
  return $side ? 1 : 0;
}
[download]

Have been testing extensively. The latest test --

>perl test_point_in_poly.pl
5000.10000.15000.20000.25000.30000.35000.40000.45000.50000.55000.60000
+.65000.70000.75000.80000.85000.90000.95000.100000.
processed 100000 polys, updated 2948276 points 
Total time: 8451 wallclock secs (6340.54 usr + 200.52 sys = 6541.06 CP
+U)
[download]

The above screen dump shows 100,000 polys processed (about half of total), and almost 3 million points within them updated with the corresponding name (almost 60% of the points). Time taken is slightly less than 2.5 hours. Extrapolating, the entire set should take less than 5 hours. That is on my piddly Dell laptop which has now only 200 Mb of space left. This process is entirely CPU bound, so it will likely be way more quick on the production machine with its faster processor. There is prep work required before and after, but even with that, it should be in the range of about 5 hours or so.

Any further suggestions to squeeze further efficiencies would be most welcome.

--

when small people start casting long shadows, it is time to go to bed

In reply to Speeding up point-in-polygon -- take two by punkish

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.