Re: fastest way to compare numerical strings?
by pc88mxer (Vicar) on Jun 30, 2008 at 18:27 UTC

It's clear that you want to find nearby points in a 2dimensional space. One effective way is to overlay a grid which groups the data into cells. Then you only have to compare points from adjacent cells.
Some pseudocode:
my %CellMembers;
for my $i (0..$npoints) {
my $cellid = int($lat[$i]).",".int($long[$i]);
push(@{$CellMembers{$cellid}}, $i);
}
# iterate over cells
for my $ci (keys %CellMembers) {
for my $cj (...cells within distance 1 of $ci...) {
for my $i (@{$CellMembers{$ci}}) {
for my $j (@{$CellMembers{$cj}} {
...compare points $i and $j...
}
}
}
}
Lots of nested loops, but probably a big time saver.
 [reply] [d/l] 

that is probably the best solution, so let's see some real (untested) code!
use POSIX qw(floor);
my @lat = ...;
my @long = ...;
my $radius = 0.01;
my $radius2 = $radius ** 2;
my @flat = map { floor($_/$radius) } @lat;
my @flong = map { floor($_/$radius) } @long;
my %cell;
for my $i (0..$#lat) {
push @{$cell{$flat[$i], $flong[$i]}}, $i
}
for my $i (0..$#lat) {
my @near;
for my $dflat(1, 0, 1) {
my $flat = $flat[$i] + $dflat;
for my $dflong(1, 0, 1) {
my $flong = $flong[$i] + $dflong;
if (my $cell = $cell{$flat, $flong}) {
for my $j (@$cell) {
if ((($lat[$j]  $lat[$i]) ** 2 + ($long[$j]  $long[$i]) **
+ 2) < $radius) {
push @near, $j
}
}
}
}
}
@near = sort @near;
print "points near $i: @near\n";
}
 [reply] [d/l] 
Re: fastest way to compare numerical strings?
by psini (Deacon) on Jun 30, 2008 at 18:10 UTC

An obvious improvement could be to sort your array first. So you can check each value on its neighbors only reducing the tests from N*N to N*M where M is the average population of clusters more 1.
Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."
 [reply] 
Re: fastest way to compare numerical strings?
by grep (Monsignor) on Jun 30, 2008 at 18:12 UTC

Just a quick note:
The take the abs value of the difference between the 2 numbers, check if that is greater than .01. That should read cleaner and get rid of most of those checks.
Also as a favor for the people trying to read your code  pick a style of indenting.
#untested
for ($n=0;$n<=$#lat;$n++) {
for ($v=0;$v<=$#lat;$v++) {
if ( abs( $lat[$n]  $lat[$v] ) < .01 and abs( $long[$n]  $l
+ong[$v] ) < .01 ) {
##OTHER CODE
}
}
}
Update: fixed typo and minor edit
grep
One dead unjugged rabbit fish later...
 [reply] [d/l] 
Re: fastest way to compare numerical strings? (O(N) )
by BrowserUk (Pope) on Jun 30, 2008 at 18:58 UTC

 [reply] 
Re: fastest way to compare numerical strings?
by jethro (Monsignor) on Jun 30, 2008 at 18:19 UTC

Sort the array, then just compare consecutive values. Sorting speed is n*log(n) instead of n^2
b) If you don't mind some numbers reported twice and the values are all somewhat equally distributed over a fixed range you might seperate them into buckets and only compare all the numbers in each bucket. For example if all numbers are between 0.000000 to 9.9999999, you might generate 100 buckets, with the first bucket having all the numbers between 0.0 and 0.11, the second between 0.1 and 0.21 (the buckets have to overlap to catch the edge cases between two buckets). The speed is n+(n/100)^2*100, which might in some cases be lower than the sort solution.
 [reply] 
Re: fastest way to compare numerical strings?
by jds17 (Pilgrim) on Jun 30, 2008 at 19:14 UTC

In case not too many numbers cluster together around a few points, an alternative is to first form clusters of numbers where the numbers are at most 2 * $delta apart ($delta = 0.01 in your case) and then only process the candidate clusters to find numbers at most $delta apart.
For the following code there is only one pass through the input array at the beginning and then sorts of small arrays ("small" in case the assumption made at the beginning holds).
#maximum distance we are looking for
$delta = 0.01;
#test array
@a = (1.02, 1.03, 6.01, 9, 1.04, 1.011, 1.025, 1.01, 0.005, 0.002);
#"discretize" points to neighboring points, scale by 1/$delta
#to simplify computation
for (@a) {
push @{$h{int($_/$delta)}}, $_;
push @{$h{int($_/$delta1)}}, $_;
push @{$h{int($_/$delta+1)}}, $_;
}
#handle clusters
for (keys %h) {
#in case the corresponding array has more than one element,
#we know that it contains at least one pair not
#further apart than 2 * $delta, otherwise ignore it
if (@{$h{$_}} > 1) {
@sorted = sort @{$h{$_}};
for (0..@sorted2) {
$r = $sorted[$_];
$s = $sorted[$_+1];
#filter out neighboring pairs, since we do not need to
#process the numbers further, we cram them into a
#string for the final output
$near{"$r, $s"} = 1 if ($r < $s && $s <= $r + $delta);
}
}
}
print "Not further than $delta apart are the following pairs:\n";
print "$_\n" for (keys %near);
Output:
Not further than 0.01 apart are the following pairs:
0.002, 0.005
1.02, 1.025
1.025, 1.03
1.011, 1.02
1.03, 1.04
1.01, 1.011
Update: Tested using the Benchmark module and a fixed array of 300000 numbers randomly distributed between 0 and 300, the near number determining part took about 8 seconds on my reasonably modern machine.I.e. the benchmark test starts like this:
use strict;
use warnings;
use Benchmark;
#maximum distance we are looking for
my $delta = 0.01;
#test array
my @a;
for (1..300000) {
push @a, rand() * 300;
}
my ($r, $s);
timethis ( 10 =>
sub { ...
Update: Improved the description, it gave the impression we first look for all numbers in the array not more than 2 * $delta apart, which is not the case.  [reply] [d/l] [select] 
Re: fastest way to compare numerical strings?
by FunkyMonk (Chancellor) on Jun 30, 2008 at 18:56 UTC

my @lat = map { rand( 180 )  90 } 1 .. 300_000;
my @long = map { rand( 180 )  90 } 1 .. @lat;
for my $lat ( 0 .. $#lat  1 ) {
if ( abs( $lat[$lat]  $lat[$lat+1] ) < 0.01 ) {
for my $long ( 0 .. $#long  1 ) {
if ( abs( $long[$long]  $long[$long+1] ) < 0.01 ) {
print "\$lat[$lat] = $lat[$lat] \$long[$long] = $long[
+$long]\n";
}
}
}
}
Depending the spread of your points, this may give a very worthwhile improvement. But, if your points are all fairly close anyway, it won't be worthwhile.
 [reply] [d/l] 

Unfortunately this does not solve exactly the same problem because you are only checking to see if $lat[$i] and $lat[$i+1] are close. The OP wants to find all $i and $j where $lat[$i] is close to $lat[$j].
You can get closer to what the OP wanted if you first sorted the points based on $lat[...]. However, you still would want to check if $lat[$i+2], $lat[$i+3], etc. were close to $lat[$i].
 [reply] [d/l] [select] 
Re: fastest way to compare numerical strings?
by pc88mxer (Vicar) on Jun 30, 2008 at 23:21 UTC

Let me suggest the nonperl solution: have your database do it for you. The magic phrase you need to tell Google seems to be "geodistance" and it yields the following interesting results:
 [reply] 

If the OP will supply the data, I pretty much guarentee that this can be solved directly in Perl more quickly that you can load the data into MySql and add a "geospatial index", let alone construct & run the query, and then retrieve the results.
And with less lines of code.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 

 [reply] 
Re: fastest way to compare numerical strings?
by apl (Monsignor) on Jun 30, 2008 at 18:20 UTC

Use your systems sort utility on the file first. The you can compare adjacent records, reducing your number of tests to N  1 rather than N**2.  [reply] 
Re: fastest way to compare numerical strings?
by Anonymous Monk on Jul 01, 2008 at 13:44 UTC

Is there a reason why your inner loop starts $v at 0?
Your loop seems to start by comparing 0 to 0, then to 1, then to 2...When $n increments you compare 1 to 0, 1 to 1, etc. But aren't all the comparisons for $v <= $n ones that you've already done? There doesn't seem to be an obvious value to comparing 0 to 1, and then later comparing 1 to 0. They are, after all, both members of the same list.
Could you change the inner loop to start with $v = $n + 1 and reduce the number of comparisons significantly?  [reply] 
Re: fastest way to compare numerical strings?
by Anonymous Monk on Jul 02, 2008 at 01:15 UTC

Hi All,
Original poster / Anonymous (non)Monk here. Thanks for these great comments. I write code as the snail crawls. I am attempting to grok your advice and have already blown portions of my mind. Thank you. I am trying things out now and will report any solutions as they arise. A couple notes:
1) Yes the PSQL/MySql solution would be good but for practical reasons are less desirable right now. This code snippet is part of a long chain, so getting into SQL might happen later but right now I need to complete the chain with the one hammer I somewhat know: Perl.
2. All lat/long values are relatively close together, within a few thousand square kilometers, just fyi...
3. Lets not have me share the data just yet, thanks for the offer though.
Thanks again and I'll post the results ASAP. It may or may not have anything to do with what you mention above, but the above certainly jogged the noodle which is helping.
 [reply] 

Lets not have me share the data just yet
A representative sample (even if made up), or just the bounding box of the dataset and precision of the points would allow us to generate random data that roughly matched the problem space to test possibilties.
Also, a brief description of the purpose of the code would help. For example, are you counting the points that lie within each cell of a 0.01 x 0.01 grid? Or are you trying to "cluster" the points regardless of any predefined gridding?
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 

Okay, I am happy with speed now. Winning answers were a combination of:
 the system 'sort' step mentioned by jethro and apl and maybe others (and searching only 1000 adjacent lines).
 the 'abs' command of grep and maybe others.
The rest of the stuff sounds fascinating and potenitally useful. Thanks, truthfully, to everyone. Soon I hope to do this on a global daily dataset where n=800million or so. If anyone wants to hammer me on better methods and such please feel free. Again, thanks to all.
 [reply] 

Okey doke, while the tests run I'll type. Basically, the code is going through points representing fire detections by a NASA satellite (MODIS). Those points that fall close together (within .01) AND occur on consecutive days and at appropriate times ($numbdate and $timestamp) are written to a text file for review in other software (Arcview GIS) as part of ongoing research.
Latitude stats:
Minimum: 10.318842
Maximum: 14.124424
Mean: 11.24719
Standard Deviation: 0.805771
Longitude stats:
Minimum: 21.097507
Maximum: 24.912207
Mean: 22.474358
Standard Deviation: 0.974835
Thanks!
 [reply] 



Re: fastest way to compare numerical strings?
by Sagacity (Monk) on Jul 04, 2008 at 16:58 UTC

Drawing from my experience when developing a
Do Not Call List discriminator, I too had thousands
of comparisons to make of sets of numbers, each list
had thousands of ten digit numbers. In the beginning
we were looking for a match, then took the steps
of what to do with it. It was slow! And then...
What I found that impacted the speed the most was
adding the thought of elimination as the primary
criteria in my comparisons. As you went through
a process of comparisons, each step had a valid
reason to either continue comparing, or most
importantly to speed, stop comparing and either
log the results, or dump it and grab the NEXT
line/entry/whatever.
So, the loop system will perform the best when
you have inserted the reasons for LAST and NEXT.
I acknowledge that your comparisons differ from what I was dealing with, but I am also sure that the results will improve if you can implement this thought process inot your code.
 [reply] 
Re: fastest way to compare numerical strings?
by salva (Abbot) on Jul 07, 2008 at 09:15 UTC

I have just upload Algorithm::ClusterPoints to CPAN.
On my 3GHz PIV machine, it is able to find the clusters from a set of 300_000 points in around 25 seconds.
 [reply] 