http://www.perlmonks.org?node_id=884064

Just in has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way (possibly a module) that is able to look up the most closest numeric key? i.e. given this hash:

my %distances = ( 452 => 'London', 678 => 'Paris', 890 => 'Rome', );

would it be possible to get the value Paris by looking up the non existent key of 672?

I'm able to do this with code that calculates each distance and then sorts the resulting distances. But it just seems a waste to have to calculate each individual distance of the hash for each key that I supply. I assume there's some big O notation that describes this.

Just in

Replies are listed 'Best First'.
Re: Lookup closest hash key
by GrandFather (Saint) on Jan 25, 2011 at 06:37 UTC

    Hashes don't store keys in a way that facilitates a direct look up of that sort. The simplest option is likely to be to sort the keys then perform a binary search. If you need to perform the lookup more than once the sorted array of keys could be saved to save a little time for subsequent searches.

    It may be that you would be better off using an array of [distance, city] pairs and either create the array sorted, or sort it once after creation. If you can't use the hash keys directly for lookup there is no advantage to using a hash at all and performing other tasks are likely to be more cumbersome so a sorted array is likely to work much better for you.

    True laziness is hard work
Re: Lookup closest hash key
by tilly (Archbishop) on Jan 25, 2011 at 07:19 UTC
    A hash is the wrong data structure to use. Instead you want an ordered data structure, like a BTree. If you want to know how one works internally, you can read BTrees in Perl.

    You don't need to roll your own. There is an excellent implementation called BerkeleyDB, which you can tie a hash to. It should be possible to do this using BerkeleyDB, with a custom sort operator and cursors.

    However I know it is possible to do this with the older DB_File. Read Matching Partial Keys which can look up the key and possibly get the next after. Then using the R_PREV flag with another seq should get you the previous key/value pair. And you can compare those two against the original to figure out which is closest.

    Note that BTrees default to lexicographic sort order, but you can pass a custom sort function. This is important if you need to avoid putting keys in the order 8, 89, 899, 9, 91, ...

Re: Lookup closest hash key
by Corion (Patriarch) on Jan 25, 2011 at 08:11 UTC

    Steffen Müller has been working on Algorithm::SpatialIndex, which creates (for example) a quadtree, a structure that can easily tell you what rectangle a point falls in. I'm not sure whether you can massage your data to cover the area with rectangles, but maybe that works or works well enough. Alternatively, maybe one of the Voronoi modules helps you to create the areas where a point is "nearest".

      The OP is only interested in nearest in a one-dimensional space. No need for quadtrees or Voronoi diagrams. Depending how often the OP needs to query his structure, I'd either keep the numbers sorted, or just do a linear pass.
        The OP is only interested in nearest in a one-dimensional space.
        Actually, I don't think he is. This looks like an XY problem to me, and I think Corion thought the same thing...

        Distances between cities is intrinsically a 2-dimensional problem.

        We don't know what the rest of the OP's code does, but I assume it's something that could be better integrated with a solution to the question that was asked.

Re: Lookup closest hash key
by LanX (Saint) on Jan 25, 2011 at 09:28 UTC
    And what if two cities have the same distance???

    Please don't use this hash, or at least invert it with reverse!

    I agree with Grandfather, use a data structure prepared for optimizing many lookups, like a sorted array. If the logarithmic O of a simple binary search is still to expensive you can still try niftier things like weighted searches, nested lookup tables for intervals or trees.

    Cheers Rolf

Re: Lookup closest hash key
by The Perlman (Scribe) on Jan 25, 2011 at 11:03 UTC
    my %distances = ( 452 => 'London', 678 => 'Paris', 890 => 'Rome', );
    > would it be possible to get the value Paris by looking up the non existent key of 672?

    Sure, add the non-existent key 672 pointing to Paris as nearest neighbor! :)

    Seriously, a one kilometer grid of distances on earth only means precalculating at most 20000 entries (but better use an array then).

    You can also go for a compromise and add a wider grid to the hash, like e.g. 10 kilometers. (The optimal width depends on your data).

    Now rounding 672 to 670 and 680 will lead you to precalculated hash entries pointing to Paris. Can't be much faster...

Re: Lookup closest hash key
by nif (Sexton) on Jan 25, 2011 at 16:20 UTC

    Why not to code B-Tree yourself?

    #!/usr/bin/perl use strict; use warnings; my %distances = ( 100 => 'London', 200 => 'Paris', 300 => 'Rome', 600 => 'Berlin' ); my @distances = sort {$a<=>$b} keys %distances; sub make_btree { my $first = shift; my $nof = shift; if ( $nof == 1 ) { return { Value => $distances[$first] }, } else { my $m = int($nof/2); my $val = ( $distances[$first+$m-1] + $distances[$first+$m] )/ +2; return { Value => $val, Left => make_btree($first, $m), Right => make_btree($first+$m, $nof-$m), } } } my $BTREE = make_btree(0, scalar @distances); #use Data::Dumper; #print Dumper( $BTREE ); sub closest_city { my $key = shift; my $btree = $BTREE; while ( exists $btree->{Left} ) { my $v = $btree->{Value}; $btree = $btree->{ $key<$v ? 'Left' : 'Right' } } $distances{ $btree->{Value} }; } print q{closest_city(5)="}, closest_city(5), qq{"\n}; print q{closest_city(149)="}, closest_city(149), qq{"\n}; print q{closest_city(150)="}, closest_city(150), qq{"\n}; print q{closest_city(203)="}, closest_city(203), qq{"\n}; print q{closest_city(290)="}, closest_city(290), qq{"\n}; print q{closest_city(500)="}, closest_city(500), qq{"\n}; print q{closest_city(5000)="}, closest_city(5000), qq{"\n};
    Output:
    closest_city(5)="London" closest_city(149)="London" closest_city(150)="Paris" closest_city(203)="Paris" closest_city(290)="Rome" closest_city(500)="Berlin" closest_city(5000)="Berlin"
    Following B-Tree will be generated in this example:
    250 / \ 150 450 / | | \ 100 200 300 600 | | | | London Paris Rome Berlin

      That's awesome! Thanks to you and Simon. Now I've got more ideas and code to benchmark.

Re: Lookup closest hash key
by BrowserUk (Patriarch) on Jan 25, 2011 at 14:04 UTC

    How many cities? And across what range (Ie. minimum and maximum distance)?

    Depending upon your answers to those questions, there is probably an O(1) solution.

      I have possibly over simplified my concern. But your replies negate the need for a hash, which confirms my suspicions. The module pointers are excellent, their reading will prove useful for similar problems.

      That said Tie::RangeHash looks very useful, and I'll do some benchmarking to see if it's worth while.

      The hash implementation I used should work for the time being, however I suspect that the actual data source will grow in magnitudes making db interaction and SQL statements more attractive. Again more benchmarking.

      My most heartiest thanks for the pointers and replies.

Re: Lookup closest hash key
by Anonymous Monk on Jan 25, 2011 at 14:55 UTC

    For a fixed and possibly large number of entries, a binary search might be theoretically fastest.   But, in these modern times, I would be hard pressed not to turn to trusty SQLite... a public domain(!) flat file SQL database of known very good performance.   All things considered, this is probably the best-all-around manageable strategy, with a rich choice of third-party tools available for maintaining the data.

Re: Lookup closest hash key
by anonymized user 468275 (Curate) on Jan 25, 2011 at 16:49 UTC
    Nah, the whole point of a hash is that it obviates coding the B-tree yourself
    my %distances = ( 452 => 'London', 678 => 'Paris', 890 => 'Rome', ); print nearest( 672, \%distances ) . "\n"; sub nearest{ my ( $dist, $href ) = @_; my ( $answer ) = ( sort { abs( $a - $dist ) <=> abs( $b - $dist ) +} keys %$href ); return $href -> { $answer }; }
    update: you could use the orcish manoeuvre to minimise sort iterations but something as simple as abs( x - y ) seems too cheap to be worth it.

    One world, one people

      Didn't you mean this, rather?
      sub nearest{ my ( $dist, $href ) = @_; my ( $answer ) = ( sort { abs( $a - $dist ) <=> abs( $b - $dist ) +} keys %$href )[0]; return $href -> { $answer }; }
      Otherwise this is a great trick, thanks!
        my ($answer) creates list context, so the [0] at the end is not needed: it is a list assignment, the first element on the left gets the first element on the right. With the index at the end, you can drop the parentheses on the left, as it is a scalar assignment:
        my $answer = ( sort { abs( $a - $dist ) <=> abs( $b - $dist ) } keys % +$href )[0];
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ