Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

fastest way to compare numerical strings?

by Anonymous Monk
on Jun 30, 2008 at 17:53 UTC ( #694790=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have an array of about 300,000 numbers (0.000000) and want to review each of those numbers to choose those which are within .01 of one another. So, in essence this is 300,000^300,000 values I guess, which takes a while to calculate. I do this simply by loading an array, and then comparing each value in that array with all other values by way of a loop within a loop. The problem is this takes about 10 hours or so to complete, at the least. Can any Monk recommend a coding solution that could speed this up? I tried using PDL but that resulted in no noticeable speed increase. Relevant portions of the code are below...
for ($n=0;$n<=$#lat;$n++) { for ($v=0;$v<=$#lat;$v++) { if ($lat[$n] > $lat[$v]-.01 && $lat[$n] < $lat[$v]+.01 && $long[$n] > $long[$v]-.01 && $long[$n] < $long[$v]+.01) { if ($numbdate[$n] == ($numbdate[$v]+1) && $timestamp[$n] > 1400 && $timestamp[$v] < 1400) { $e++; }; if ($numbdate[$n] == ($numbdate[$v]) && $timestamp[$n] < ($timestamp[$v]-400)) { $d++; }; } } }

Replies are listed 'Best First'.
Re: fastest way to compare numerical strings?
by pc88mxer (Vicar) on Jun 30, 2008 at 18:27 UTC
    It's clear that you want to find nearby points in a 2-dimensional space. One effective way is to overlay a grid which groups the data into cells. Then you only have to compare points from adjacent cells. Some pseudo-code:
    my %CellMembers; for my $i (0..$npoints) { my $cellid = int($lat[$i]).",".int($long[$i]); push(@{$CellMembers{$cellid}}, $i); } # iterate over cells for my $ci (keys %CellMembers) { for my $cj (...cells within distance 1 of $ci...) { for my $i (@{$CellMembers{$ci}}) { for my $j (@{$CellMembers{$cj}} { ...compare points $i and $j... } } } }
    Lots of nested loops, but probably a big time saver.
      that is probably the best solution, so let's see some real (untested) code!
      use POSIX qw(floor); my @lat = ...; my @long = ...; my $radius = 0.01; my $radius2 = $radius ** 2; my @flat = map { floor($_/$radius) } @lat; my @flong = map { floor($_/$radius) } @long; my %cell; for my $i (0..$#lat) { push @{$cell{$flat[$i], $flong[$i]}}, $i } for my $i (0..$#lat) { my @near; for my $dflat(-1, 0, 1) { my $flat = $flat[$i] + $dflat; for my $dflong(-1, 0, 1) { my $flong = $flong[$i] + $dflong; if (my $cell = $cell{$flat, $flong}) { for my $j (@$cell) { if ((($lat[$j] - $lat[$i]) ** 2 + ($long[$j] - $long[$i]) ** + 2) < $radius) { push @near, $j } } } } } @near = sort @near; print "points near $i: @near\n"; }
Re: fastest way to compare numerical strings?
by psini (Deacon) on Jun 30, 2008 at 18:10 UTC

    An obvious improvement could be to sort your array first. So you can check each value on its neighbors only reducing the tests from N*N to N*M where M is the average population of clusters more 1.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: fastest way to compare numerical strings?
by grep (Monsignor) on Jun 30, 2008 at 18:12 UTC
    Just a quick note:

    The take the abs value of the difference between the 2 numbers, check if that is greater than .01. That should read cleaner and get rid of most of those checks.

    Also as a favor for the people trying to read your code - pick a style of indenting.

    #untested for ($n=0;$n<=$#lat;$n++) { for ($v=0;$v<=$#lat;$v++) { if ( abs( $lat[$n] - $lat[$v] ) < .01 and abs( $long[$n] - $l +ong[$v] ) < .01 ) { ##OTHER CODE } } }
    Update: fixed typo and minor edit
    grep
    One dead unjugged rabbit fish later...
Re: fastest way to compare numerical strings? (O(N) )
by BrowserUk (Pope) on Jun 30, 2008 at 18:58 UTC
Re: fastest way to compare numerical strings?
by jethro (Monsignor) on Jun 30, 2008 at 18:19 UTC
    Sort the array, then just compare consecutive values. Sorting speed is n*log(n) instead of n^2

    b) If you don't mind some numbers reported twice and the values are all somewhat equally distributed over a fixed range you might seperate them into buckets and only compare all the numbers in each bucket. For example if all numbers are between 0.000000 to 9.9999999, you might generate 100 buckets, with the first bucket having all the numbers between 0.0 and 0.11, the second between 0.1 and 0.21 (the buckets have to overlap to catch the edge cases between two buckets). The speed is n+(n/100)^2*100, which might in some cases be lower than the sort solution.

Re: fastest way to compare numerical strings?
by jds17 (Pilgrim) on Jun 30, 2008 at 19:14 UTC
    In case not too many numbers cluster together around a few points, an alternative is to first form clusters of numbers where the numbers are at most 2 * $delta apart ($delta = 0.01 in your case) and then only process the candidate clusters to find numbers at most $delta apart.

    For the following code there is only one pass through the input array at the beginning and then sorts of small arrays ("small" in case the assumption made at the beginning holds).

    #maximum distance we are looking for $delta = 0.01; #test array @a = (1.02, 1.03, 6.01, 9, 1.04, 1.011, 1.025, 1.01, 0.005, -0.002); #"discretize" points to neighboring points, scale by 1/$delta #to simplify computation for (@a) { push @{$h{int($_/$delta)}}, $_; push @{$h{int($_/$delta-1)}}, $_; push @{$h{int($_/$delta+1)}}, $_; } #handle clusters for (keys %h) { #in case the corresponding array has more than one element, #we know that it contains at least one pair not #further apart than 2 * $delta, otherwise ignore it if (@{$h{$_}} > 1) { @sorted = sort @{$h{$_}}; for (0..@sorted-2) { $r = $sorted[$_]; $s = $sorted[$_+1]; #filter out neighboring pairs, since we do not need to #process the numbers further, we cram them into a #string for the final output $near{"$r, $s"} = 1 if ($r < $s && $s <= $r + $delta); } } } print "Not further than $delta apart are the following pairs:\n"; print "$_\n" for (keys %near);
    Output:
    Not further than 0.01 apart are the following pairs: -0.002, 0.005 1.02, 1.025 1.025, 1.03 1.011, 1.02 1.03, 1.04 1.01, 1.011
    Update: Tested using the Benchmark module and a fixed array of 300000 numbers randomly distributed between 0 and 300, the near number determining part took about 8 seconds on my reasonably modern machine.

    I.e. the benchmark test starts like this:

    use strict; use warnings; use Benchmark; #maximum distance we are looking for my $delta = 0.01; #test array my @a; for (1..300000) { push @a, rand() * 300; } my ($r, $s); timethis ( 10 => sub { ...
    Update: Improved the description, it gave the impression we first look for all numbers in the array not more than 2 * $delta apart, which is not the case.
Re: fastest way to compare numerical strings?
by FunkyMonk (Chancellor) on Jun 30, 2008 at 18:56 UTC
    You may get a very substantial improvement by only checking the longitudes if the latitudes are close. You can do this easily by splitting your if into two.

    This runs in around 6 seconds here!

    my @lat = map { rand( 180 ) - 90 } 1 .. 300_000; my @long = map { rand( 180 ) - 90 } 1 .. @lat; for my $lat ( 0 .. $#lat - 1 ) { if ( abs( $lat[$lat] - $lat[$lat+1] ) < 0.01 ) { for my $long ( 0 .. $#long - 1 ) { if ( abs( $long[$long] - $long[$long+1] ) < 0.01 ) { print "\$lat[$lat] = $lat[$lat] \$long[$long] = $long[ +$long]\n"; } } } }

    Depending the spread of your points, this may give a very worthwhile improvement. But, if your points are all fairly close anyway, it won't be worthwhile.


    Unless I state otherwise, all my code runs with strict and warnings
      Unfortunately this does not solve exactly the same problem because you are only checking to see if $lat[$i] and $lat[$i+1] are close. The OP wants to find all $i and $j where $lat[$i] is close to $lat[$j].

      You can get closer to what the OP wanted if you first sorted the points based on $lat[...]. However, you still would want to check if $lat[$i+2], $lat[$i+3], etc. were close to $lat[$i].

Re: fastest way to compare numerical strings?
by pc88mxer (Vicar) on Jun 30, 2008 at 23:21 UTC
    Let me suggest the non-perl solution: have your database do it for you. The magic phrase you need to tell Google seems to be "geodistance" and it yields the following interesting results:

      If the OP will supply the data, I pretty much guarentee that this can be solved directly in Perl more quickly that you can load the data into MySql and add a "geospatial index", let alone construct & run the query, and then retrieve the results.

      And with less lines of code.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        True, but there's always the possibility that the data is already (or wants to be) in a database.

        I'm just throwing out other options.

Re: fastest way to compare numerical strings?
by apl (Monsignor) on Jun 30, 2008 at 18:20 UTC
    Use your systems sort utility on the file first. The you can compare adjacent records, reducing your number of tests to N - 1 rather than N**2.
Re: fastest way to compare numerical strings?
by Anonymous Monk on Jul 01, 2008 at 13:44 UTC
    Is there a reason why your inner loop starts $v at 0?

    Your loop seems to start by comparing 0 to 0, then to 1, then to 2...When $n increments you compare 1 to 0, 1 to 1, etc. But aren't all the comparisons for $v <= $n ones that you've already done? There doesn't seem to be an obvious value to comparing 0 to 1, and then later comparing 1 to 0. They are, after all, both members of the same list.

    Could you change the inner loop to start with $v = $n + 1 and reduce the number of comparisons significantly?

Re: fastest way to compare numerical strings?
by Anonymous Monk on Jul 02, 2008 at 01:15 UTC
    Hi All,
    Original poster / Anonymous (non)Monk here. Thanks for these great comments. I write code as the snail crawls. I am attempting to grok your advice and have already blown portions of my mind. Thank you. I am trying things out now and will report any solutions as they arise. A couple notes:
    1) Yes the PSQL/MySql solution would be good but for practical reasons are less desirable right now. This code snippet is part of a long chain, so getting into SQL might happen later but right now I need to complete the chain with the one hammer I somewhat know: Perl.
    2. All lat/long values are relatively close together, within a few thousand square kilometers, just fyi...
    3. Lets not have me share the data just yet, thanks for the offer though.
    Thanks again and I'll post the results ASAP. It may or may not have anything to do with what you mention above, but the above certainly jogged the noodle which is helping.
      Lets not have me share the data just yet

      A representative sample (even if made up), or just the bounding box of the dataset and precision of the points would allow us to generate random data that roughly matched the problem space to test possibilties.

      Also, a brief description of the purpose of the code would help. For example, are you counting the points that lie within each cell of a 0.01 x 0.01 grid? Or are you trying to "cluster" the points regardless of any predefined gridding?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Okay, I am happy with speed now. Winning answers were a combination of:

      • the system 'sort' step mentioned by jethro and apl and maybe others (and searching only 1000 adjacent lines).

      • the 'abs' command of grep and maybe others.

        The rest of the stuff sounds fascinating and potenitally useful. Thanks, truthfully, to everyone. Soon I hope to do this on a global daily dataset where n=800million or so. If anyone wants to hammer me on better methods and such please feel free. Again, thanks to all.
        Okey doke, while the tests run I'll type. Basically, the code is going through points representing fire detections by a NASA satellite (MODIS). Those points that fall close together (within .01) AND occur on consecutive days and at appropriate times ($numbdate and $timestamp) are written to a text file for review in other software (Arcview GIS) as part of ongoing research.

        Latitude stats:
        Minimum: 10.318842
        Maximum: 14.124424
        Mean: 11.24719
        Standard Deviation: 0.805771


        Longitude stats:
        Minimum: 21.097507
        Maximum: 24.912207
        Mean: 22.474358
        Standard Deviation: 0.974835


        Thanks!
Re: fastest way to compare numerical strings?
by Sagacity (Monk) on Jul 04, 2008 at 16:58 UTC

    Drawing from my experience when developing a Do Not Call List discriminator, I too had thousands of comparisons to make of sets of numbers, each list had thousands of ten digit numbers. In the beginning we were looking for a match, then took the steps of what to do with it. It was slow! And then...

    What I found that impacted the speed the most was adding the thought of elimination as the primary criteria in my comparisons. As you went through a process of comparisons, each step had a valid reason to either continue comparing, or most importantly to speed, stop comparing and either log the results, or dump it and grab the NEXT line/entry/whatever.

    So, the loop system will perform the best when you have inserted the reasons for LAST and NEXT.

    I acknowledge that your comparisons differ from what I was dealing with, but I am also sure that the results will improve if you can implement this thought process inot your code.

Re: fastest way to compare numerical strings?
by salva (Abbot) on Jul 07, 2008 at 09:15 UTC
    I have just upload Algorithm::ClusterPoints to CPAN.

    On my 3GHz PIV machine, it is able to find the clusters from a set of 300_000 points in around 25 seconds.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://694790]
Approved by pc88mxer
Front-paged by salva
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2019-11-15 12:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (83 votes). Check out past polls.

    Notices?