Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Fastest way to lookup a point in a set

by LanX (Saint)
on Aug 05, 2017 at 19:37 UTC ( [id://1196821]=note: print w/replies, xml ) Need Help??


in reply to Fastest way to lookup a point in a set

> I have a large set of points. ... I need to insert the set of points into a Perl data structure ... then perform many lookups to tell if a given point is in the set or not. I want the lookup to be as fast as possible.

How large is "large", how many is "many"?

I have the impression that it's rather a space problem, my approach would be to split a hash into a HoH to allow swapping of the second tier.

This works well if you can organize your look-ups in a way (ordering or caching) that accesses to the second tier of hashes is bundled, such that you have a minimal amount of swapping. see also Re: write hash to disk after memory limit and Re: Small Hash a Gateway to Large Hash?

For instance looking up $point{$x}{$y} would be the fastest if you can you can bundle all look-ups to points ($x,*) for one fixed $x.

tl;dr maybe I'm missing the point...(?)

On a side note: are you aware of multi-dim hashes in Perl?

C:\Windows\system32>perl -e "$x=1;$y=2;$h{$x,$y}=4; print $h{$x,$y}" 4

If yes, why do you you need to join the keys by yourself???

> String: $x . ':' . $y

see perldata#Multi-dimensional-array-emulation

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^2: Fastest way to lookup a point in a set
by eyepopslikeamosquito (Archbishop) on Aug 05, 2017 at 23:30 UTC

    On a side note: are you aware of multi-dim hashes in Perl?
    No, I wasn't aware of them. Curiously, when I added them to my original test, they turned out to be slower than join. That is, Multi:
    exists $cells->{$p->[0],$p->[1]} or die;
    was slower than Str:
    exists $cells->{ join ':', @{$p} } or die;
    Benchmark results:
    Str: 7 wallclock secs ( 6.58 usr + 0.00 sys =.58 CPU) @ 30404.38/s ( +n=200000) Multi: 8 wallclock secs ( 7.98 usr + 0.00 sys =.98 CPU) @ 25050.10/s ( +n=200000)

    Update: though Multi was faster than:

    exists $cells->{ join ':', $p->[0], $p->[1] } or die;

      > they turned out to be slower than join.

      Interesting, I think that's because you address the points separately in multi (2 lookups) while join can optimize over an array (one list returned)

      Maybe try to see how it compares with join over 2 elements.

      I wouldn't be surprised if the result is equally fast. Multi dimensional hashes are an old (Perl 4) but somehow obscure feature. They probably never optimized it over the equivalence to join.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        There is also the issue of locales and the radix char when non-integer values are used with multidimensional hash keys.

        If one has coordinates

        [5, 5.5] [5.5, 5 ]
        then using the multidimensional hash approach in a locale where both $; and the radix char are a comma would result in ambiguous keys:
        {5,5,5} {5,5,5}

        I know the original post specifies signed ints, so I'm just being more generic here. I also haven't checked if there is locale specific behaviour of $;

      Maybe stringification is faster
      use Data::Dump qw/pp dd/; local $" = $; ; $p=[1,2]; $h{"@$p"}=12; $h{3,4}=34; dd %h;

      ("3\x1C4", 34, "1\x1C2", 12)

      no time to benchmark, sorry! (no pun intended ;-)

      update

      Probably is fastest to keep points all the time in that representation.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        Update: Regarding performance, stringification "@$p" reaches join(':',@$p) in Perl 5.22 and onwards. See this post.

        Hi LanX,

        Added lan_hash and lan_look, based on str_hash and str_look respectively. The stringification is a nice catch. Surprisingly, I didn't expect for it to run slower compared to join(':', @$p) and thought as fast or faster.

        sub lan_hash { # print "LanX_hash---------------\n"; my %cells; # insert the points into the hash for my $p (@points) { my $h = $p->[0] . ' ' . $p->[1]; my ( $x, $y ) = split ' ', $h; # print "x='$p->[0]' y='$p->[1]' h='$h' (x='$x' y='$y')\n"; if ($x != $p->[0]) { die; } if ($y != $p->[1]) { die; } $cells{$h} = undef; # ++$cells{$h}; } scalar(keys %cells) == $npoints or die; # lookup each points in the hash for my $p (@points) { my $h = $p->[0] . ' ' . $p->[1]; exists $cells{$h} or die; } exists $cells{'notfound'} and die; exists $cells{'notfound2'} and die; exists $cells{'notfound3'} and die; return \%cells; } sub lan_look { my $cells = shift; for my $p (@points) { exists $cells->{ "@$p" } or die; } exists $cells->{'notfound'} and die; exists $cells->{'notfound2'} and die; exists $cells->{'notfound3'} and die; } my $big_ref = big_hash(); my $lan_ref = lan_hash(); my $mat_ref = mat_hash(); my $pak_ref = pak_hash(); my $st2_ref = str_hash(); my $st3_ref = str_hash(); my $str_ref = str_hash(); timethese 200000, { Big => sub { big_look($big_ref) }, # $cells->{ ($p->[1] << 32) | ($ +p->[0] & 0xFFFFFFFF) } Lan => sub { lan_look($lan_ref) }, # $cells->{ "@$p" } Mat => sub { mat_look($mat_ref) }, # $cells->{ $_->[0] }{ $_->[1] } Pak => sub { pak_look($pak_ref) }, # $cells->{ pack "ii", $p->[0], +$p->[1] } St2 => sub { st2_look($st2_ref) }, # $cells->{ join(':', @$p) } St3 => sub { st3_look($st3_ref) }, # $cells->{ $str } # optimized Str => sub { str_look($str_ref) }, # $cells->{ $p->[0] .':'. $p->[1 +] } };

        Results from CentOS 7.3 VM running Perl 5.16.3.

        Regards, Mario

      You did not answer his question though. Can you batch the queries or do you need to run them one by one?

      If batching is allowed, then LanX has more or less revealed the solution. The problem is then very similar to sorting, and the fastest way to sort is probably the usual radix sort. (Maybe with AVX512 the balance has tipped in favor of merge sort; but I haven't tested that.)

      Once you have binned/sorted the queries, the lookup is essentially a linear pass over the (sorted) dataset. Don't loop over queries, map them.

        Thanks for the tip. There is some scope for batching, I believe, and I hadn't thought of sorting. I'll write a new root node in the next day or so with details on the full problem.

        Update: the new root node: High Performance Game of Life

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1196821]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-23 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found