Re: Re: Zipcode Proximity script

in reply to Re: Zipcode Proximity script
in thread Zipcode Proximity script

Here's a way to do all that prefiltering people keep mentioning: First, some helper procedures:

use constant EARTH_RADIUS => 3956;
use constant PI           => 4 * atan2 1, 1;

sub make_quick_reject {
  # Args: lat (radians), long (radians), dist
  my($lat,$long,$dist) = @_;
  my($distcos) = cos($dist / EARTH_RADIUS);
  my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos);
  if (abs($lat) + $maxlatdiff > PI/2) {
    # special case - near poles
    return sub { abs($lat - $_[0]) > $maxlatdiff; };
  }
  my($longmult);
  $longmult = 1 / cos(abs($lat) + $maxlatdiff);
  my($maxlongdiff) = $maxlatdiff * $longmult;
  return sub { abs($lat - $_[0]) > $maxlatdiff or abs($long - $_[1]) >
+ $maxlongdiff; }
}

sub make_quick_accept {
  # Args: lat (radians), long (radians), dist
  my($lat,$long,$dist) = @_;
  my($distcos) = cos($dist / EARTH_RADIUS);
  my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos) /
+ sqrt(2);
  my($longmult);
  if (abs($lat) < $maxlatdiff ) {
    # special case - near equator
    $longmult = 1;
  } else {
    $longmult = 1 / cos(abs($lat) - $maxlatdiff);
  }
  my($maxlongdiff) = $maxlatdiff * $longmult;
  return sub { abs($lat - $_[0]) < $maxlatdiff and 
               abs($long - $_[1]) < $maxlongdiff; }
}
[download]

Then, I assume that you've loaded your data into something like this:

# zipcode, lat, long
@data = (['00210', 43.00589, 71.0132], ['00211', 43.00589, 71.0132], .
+..);
[download]

And then done something like this to it:

map { $_->[1] *= PI/180; $_->[2] *= PI / 180;} @data;
[download]

Then you can do something like this:

$distance = 25;

for $i (0..$#data) {
  my $quick_reject_routine = 
      make_quick_reject($data[$i]->[1], $data[$i]->[2], $distance);
  my $quick_accept_routine =
      make_quick_accept($data[$i]->[1], $data[$i]->[2], $distance);
  for $record (grep(! $quick_reject_routine->($_->[1],$_->[2]), 
                    @data[$i+1..$#data])) {
    if ($quick_accept_routine->($record->[1],$record->[2])
        or
        long_acceptance_routine($record, $data[$i], $distance) {
      # write it to some file or other
    }
  }
}
[download]

Where for "long_calculation_routine", you substitute in essentially what you're doing now. (or better yet, something analogous to the simpler C code suggested in the post I made before logging in) Extending this to the case of checking several distances at once shouldn't be hard - just construct an array of quick reject and quick accept procedures, and then check through them in order before trying the long acceptance routine. Something like this:

@distances = qw(5 10 25 50 100);
@distfiles = map { "0-$_.txt" } @distances;

for $i (0..$#data) {
  my @quick_reject_routines = 
      map {
        make_quick_reject($data[$i]->[1], $data[$i]->[2], $_);
      }
      @distances;
  my @quick_accept_routines =
      map {
        make_quick_accept($data[$i]->[1], $data[$i]->[2], $_);
      }
      @distances;
  OTHERCITY:
  for $j ($i..$#data)) {
    DISTANCE:
    for $d (0..$#distances) {
      if ($quick_reject_routines[$d]->($data[$j][1],$data[$j][2])) {
        next DISTANCE;
      }
      if ($quick_accept_routines[$d]->($data[$j][1],$data[$j][2])
          or
          long_acceptance_routine($data[$j], $data[$i], $distance) {
        map { write_csv_line($data[$i], $data[$j], $_);
              write_csv_line($data[$j], $data[$i], $_); } 
            @distfiles[$d..$#distances];
        next OTHERCITY;
      }
    }
  }
}
[download]

Of course, this still contains extraneous calls to long_acceptance_routine, but my hope is that the routines long_acceptance_routine depends on will somehow be caching recent results (see the Memoize modules) and so there won't be the performance penalty.

In Section Seekers of Perl Wisdom