|Perl: the Markov chain saw|
Speeding up point-in-polygon -- take twoby punkish (Priest)
|on Aug 28, 2006 at 04:16 UTC||Need Help??|
punkish has asked for the
wisdom of the Perl Monks concerning the following question:
A month or so ago I asked a question about speeding up point-in-poly tests Speeding up point-in-polygon. I received some excellent suggestions, and I have implemented one of them. Today, I am posting my solution here just in case someone can suggest methods for further gains in efficiency. To recap the problem --
Then, using Geo::ShapeFile, I calculated the bounding box of each poly and update xmin, ymin, xmax, ymax in the polys table. I also unpacked each of the polys and updated polys.n with the number of points in the poly, and polys.ar_x and polys.ar_y with a list of all the x coords and y coords that make up the poly. I store the list of coords as a comma-separate list that I can convert back into an array using split(/,/, $ar_x) and so.
Then I loop over each poly in the table and select all the points that might potentially fall within it using the following query --
Since polys can be irregular, a given point may fall within its bounding box, but still not be within the poly. So I make a definitive check of that for every point selected in the above query against that poly, and if the is positive, I
I used the most excellent Devel::Profiler to find and fix the bottlenecks, and I believe I have made it as efficient as I can. Most of my process time is now being spent doing the point-in-poly, which I do as many times as there are points, in fact, even more in case a point ends up falling in more than one bounding-boxes. I use the following algorithm lifted from Mastering Algorithms With Perl aka the Wolf Book. (this is where I use polys.n, polys.ar_x, polys.ar_y)
Have been testing extensively. The latest test --
The above screen dump shows 100,000 polys processed (about half of total), and almost 3 million points within them updated with the corresponding name (almost 60% of the points). Time taken is slightly less than 2.5 hours. Extrapolating, the entire set should take less than 5 hours. That is on my piddly Dell laptop which has now only 200 Mb of space left. This process is entirely CPU bound, so it will likely be way more quick on the production machine with its faster processor. There is prep work required before and after, but even with that, it should be in the range of about 5 hours or so.
Any further suggestions to squeeze further efficiencies would be most welcome.
when small people start casting long shadows, it is time to go to bed