in reply to
Re: Zipcode Proximity script

in thread Zipcode Proximity script

Here's a way to do all that prefiltering people keep mentioning:
First, some helper procedures:

`use constant EARTH_RADIUS => 3956;
use constant PI => 4 * atan2 1, 1;
sub make_quick_reject {
# Args: lat (radians), long (radians), dist
my($lat,$long,$dist) = @_;
my($distcos) = cos($dist / EARTH_RADIUS);
my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos);
if (abs($lat) + $maxlatdiff > PI/2) {
# special case - near poles
return sub { abs($lat - $_[0]) > $maxlatdiff; };
}
my($longmult);
$longmult = 1 / cos(abs($lat) + $maxlatdiff);
my($maxlongdiff) = $maxlatdiff * $longmult;
return sub { abs($lat - $_[0]) > $maxlatdiff or abs($long - $_[1]) >
+ $maxlongdiff; }
}
sub make_quick_accept {
# Args: lat (radians), long (radians), dist
my($lat,$long,$dist) = @_;
my($distcos) = cos($dist / EARTH_RADIUS);
my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos) /
+ sqrt(2);
my($longmult);
if (abs($lat) < $maxlatdiff ) {
# special case - near equator
$longmult = 1;
} else {
$longmult = 1 / cos(abs($lat) - $maxlatdiff);
}
my($maxlongdiff) = $maxlatdiff * $longmult;
return sub { abs($lat - $_[0]) < $maxlatdiff and
abs($long - $_[1]) < $maxlongdiff; }
}
`

Then, I assume that you've loaded your data into something like this:

`# zipcode, lat, long
@data = (['00210', 43.00589, 71.0132], ['00211', 43.00589, 71.0132], .
+..);
`

And then done something like this to it:

`map { $_->[1] *= PI/180; $_->[2] *= PI / 180;} @data;
`

Then you can do something like this:

`$distance = 25;
for $i (0..$#data) {
my $quick_reject_routine =
make_quick_reject($data[$i]->[1], $data[$i]->[2], $distance);
my $quick_accept_routine =
make_quick_accept($data[$i]->[1], $data[$i]->[2], $distance);
for $record (grep(! $quick_reject_routine->($_->[1],$_->[2]),
@data[$i+1..$#data])) {
if ($quick_accept_routine->($record->[1],$record->[2])
or
long_acceptance_routine($record, $data[$i], $distance) {
# write it to some file or other
}
}
}
`

Where for "long_calculation_routine", you substitute in essentially what you're doing now.
(or better yet, something analogous to the simpler C code suggested in the post I made
before logging in)
Extending this to the case of checking several distances at once shouldn't be hard - just
construct an array of quick reject and quick accept procedures, and then check through
them in order before trying the long acceptance routine.
Something like this:

`@distances = qw(5 10 25 50 100);
@distfiles = map { "0-$_.txt" } @distances;
for $i (0..$#data) {
my @quick_reject_routines =
map {
make_quick_reject($data[$i]->[1], $data[$i]->[2], $_);
}
@distances;
my @quick_accept_routines =
map {
make_quick_accept($data[$i]->[1], $data[$i]->[2], $_);
}
@distances;
OTHERCITY:
for $j ($i..$#data)) {
DISTANCE:
for $d (0..$#distances) {
if ($quick_reject_routines[$d]->($data[$j][1],$data[$j][2])) {
next DISTANCE;
}
if ($quick_accept_routines[$d]->($data[$j][1],$data[$j][2])
or
long_acceptance_routine($data[$j], $data[$i], $distance) {
map { write_csv_line($data[$i], $data[$j], $_);
write_csv_line($data[$j], $data[$i], $_); }
@distfiles[$d..$#distances];
next OTHERCITY;
}
}
}
}
`

Of course, this still contains extraneous calls to long_acceptance_routine, but my hope is
that the routines long_acceptance_routine depends on will somehow be caching recent results
(see the Memoize modules) and so there won't be the performance penalty.