Re: Fastest way to lookup a point in a set

> I have a large set of points. ... I need to insert the set of points into a Perl data structure ... then perform many lookups to tell if a given point is in the set or not. I want the lookup to be as fast as possible.

How large is "large", how many is "many"?

I have the impression that it's rather a space problem, my approach would be to split a hash into a HoH to allow swapping of the second tier.

This works well if you can organize your look-ups in a way (ordering or caching) that accesses to the second tier of hashes is bundled, such that you have a minimal amount of swapping. see also Re: write hash to disk after memory limit and Re: Small Hash a Gateway to Large Hash?

For instance looking up $point{$x}{$y} would be the fastest if you can you can bundle all look-ups to points ($x,*) for one fixed $x.

tl;dr maybe I'm missing the point...(?)

On a side note: are you aware of multi-dim hashes in Perl?

C:\Windows\system32>perl -e "$x=1;$y=2;$h{$x,$y}=4; print $h{$x,$y}"
4
[download]

If yes, why do you you need to join the keys by yourself???

> String: $x . ':' . $y

see perldata#Multi-dimensional-array-emulation

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

Comment on Re: Fastest way to lookup a point in a set Select or Download Code

Replies are listed 'Best First'.
Re^2: Fastest way to lookup a point in a set by eyepopslikeamosquito (Archbishop) on Aug 05, 2017 at 23:30 UTC
On a side note: are you aware of multi-dim hashes in Perl? No, I wasn't aware of them. Curiously, when I added them to my original test, they turned out to be slower than join. That is, Multi: `exists $cells->{$p->[0],$p->[1]} or die;` [download] was slower than Str: `exists $cells->{ join ':', @{$p} } or die;` [download] Benchmark results: `Str: 7 wallclock secs ( 6.58 usr + 0.00 sys =.58 CPU) @ 30404.38/s ( +n=200000) Multi: 8 wallclock secs ( 7.98 usr + 0.00 sys =.98 CPU) @ 25050.10/s ( +n=200000)` [download] Update: though Multi was faster than: `exists $cells->{ join ':', $p->[0], $p->[1] } or die;` [download]	[reply] [d/l] [select]
Re^3: Fastest way to lookup a point in a set by LanX (Saint) on Aug 05, 2017 at 23:44 UTC
> they turned out to be slower than join. Interesting, I think that's because you address the points separately in multi (2 lookups) while join can optimize over an array (one list returned) Maybe try to see how it compares with join over 2 elements. I wouldn't be surprised if the result is equally fast. Multi dimensional hashes are an old (Perl 4) but somehow obscure feature. They probably never optimized it over the equivalence to join. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^4: Fastest way to lookup a point in a set by swl (Parson) on Aug 07, 2017 at 09:19 UTC
There is also the issue of locales and the radix char when non-integer values are used with multidimensional hash keys. If one has coordinates `[5, 5.5] [5.5, 5 ]` [download] then using the multidimensional hash approach in a locale where both $; and the radix char are a comma would result in ambiguous keys: `{5,5,5} {5,5,5}` [download] I know the original post specifies signed ints, so I'm just being more generic here. I also haven't checked if there is locale specific behaviour of $;	[reply] [d/l] [select]
Re^5: Fastest way to lookup a point in a set by LanX (Saint) on Aug 07, 2017 at 10:17 UTC
Re^6: Fastest way to lookup a point in a set by swl (Parson) on Aug 07, 2017 at 23:53 UTC
Re^3: Fastest way to lookup a point in a set (updated) by LanX (Saint) on Aug 06, 2017 at 00:39 UTC
Maybe stringification is faster `use Data::Dump qw/pp dd/; local $" = $; ; $p=[1,2]; $h{"@$p"}=12; $h{3,4}=34; dd %h;` [download] `("3\x1C4", 34, "1\x1C2", 12)` [download] no time to benchmark, sorry! (no pun intended ;-) update Probably is fastest to keep points all the time in that representation. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^4: Fastest way to lookup a point in a set (updated) by marioroy (Prior) on Aug 06, 2017 at 03:11 UTC
Update: Regarding performance, stringification `"@$p"` reaches `join(':',@$p)` in Perl 5.22 and onwards. See this post. Hi LanX, Added lan_hash and lan_look, based on str_hash and str_look respectively. The stringification is a nice catch. Surprisingly, I didn't expect for it to run slower compared to `join(':', @$p)` and thought as fast or faster. sub lan_hash { # print "LanX_hash---------------\n"; my %cells; # insert the points into the hash for my $p (@points) { my $h = $p->[0] . ' ' . $p->[1]; my ( $x, $y ) = split ' ', $h; # print "x='$p->[0]' y='$p->[1]' h='$h' (x='$x' y='$y')\n"; if ($x != $p->[0]) { die; } if ($y != $p->[1]) { die; } $cells{$h} = undef; # ++$cells{$h}; } scalar(keys %cells) == $npoints or die; # lookup each points in the hash for my $p (@points) { my $h = $p->[0] . ' ' . $p->[1]; exists $cells{$h} or die; } exists $cells{'notfound'} and die; exists $cells{'notfound2'} and die; exists $cells{'notfound3'} and die; return \%cells; } sub lan_look { my $cells = shift; for my $p (@points) { exists $cells->{ "@$p" } or die; } exists $cells->{'notfound'} and die; exists $cells->{'notfound2'} and die; exists $cells->{'notfound3'} and die; } my $big_ref = big_hash(); my $lan_ref = lan_hash(); my $mat_ref = mat_hash(); my $pak_ref = pak_hash(); my $st2_ref = str_hash(); my $st3_ref = str_hash(); my $str_ref = str_hash(); timethese 200000, { Big => sub { big_look($big_ref) }, # $cells->{ ($p->[1] << 32) \| ($ +p->[0] & 0xFFFFFFFF) } Lan => sub { lan_look($lan_ref) }, # $cells->{ "@$p" } Mat => sub { mat_look($mat_ref) }, # $cells->{ $_->[0] }{ $_->[1] } Pak => sub { pak_look($pak_ref) }, # $cells->{ pack "ii", $p->[0], +$p->[1] } St2 => sub { st2_look($st2_ref) }, # $cells->{ join(':', @$p) } St3 => sub { st3_look($st3_ref) }, # $cells->{ $str } # optimized Str => sub { str_look($str_ref) }, # $cells->{ $p->[0] .':'. $p->[1 +] } }; [download] Results from CentOS 7.3 VM running Perl 5.16.3. Read more... (4 kB) Regards, Mario	[reply] [d/l] [select]
Re^3: Fastest way to lookup a point in a set by oiskuu (Hermit) on Aug 07, 2017 at 17:32 UTC
You did not answer his question though. Can you batch the queries or do you need to run them one by one? If batching is allowed, then LanX has more or less revealed the solution. The problem is then very similar to sorting, and the fastest way to sort is probably the usual radix sort. (Maybe with AVX512 the balance has tipped in favor of merge sort; but I haven't tested that.) Once you have binned/sorted the queries, the lookup is essentially a linear pass over the (sorted) dataset. Don't loop over queries, map them.	[reply]
Re^4: Fastest way to lookup a point in a set by eyepopslikeamosquito (Archbishop) on Aug 07, 2017 at 21:42 UTC
Thanks for the tip. There is some scope for batching, I believe, and I hadn't thought of sorting. I'll write a new root node in the next day or so with details on the full problem. Update: the new root node: High Performance Game of Life	[reply]


Keep It Simple, Stupid
	PerlMonks

Re: Fastest way to lookup a point in a set

update