in reply to Fastest way to lookup a point in a set
> I have a large set of points. ... I need to insert the set of points into a Perl data structure ... then perform many lookups to tell if a given point is in the set or not. I want the lookup to be as fast as possible.
How large is "large", how many is "many"?
I have the impression that it's rather a space problem, my approach would be to split a hash into a HoH to allow swapping of the second tier.
This works well if you can organize your look-ups in a way (ordering or caching) that accesses to the second tier of hashes is bundled, such that you have a minimal amount of swapping.
see also Re: write hash to disk after memory limit and Re: Small Hash a Gateway to Large Hash?
For instance looking up $point{$x}{$y} would be the fastest if you can you can bundle all look-ups to points ($x,*) for one fixed $x.
tl;dr maybe I'm missing the point...(?)
On a side note: are you aware of multi-dim hashes in Perl?
C:\Windows\system32>perl -e "$x=1;$y=2;$h{$x,$y}=4; print $h{$x,$y}"
4
If yes, why do you you need to join the keys by yourself???
> String: $x . ':' . $y
see perldata#Multi-dimensional-array-emulation
Re^2: Fastest way to lookup a point in a set
by eyepopslikeamosquito (Archbishop) on Aug 05, 2017 at 23:30 UTC
|
On a side note: are you aware of multi-dim hashes in Perl?
No, I wasn't aware of them.
Curiously, when I added them to my original test,
they turned out to be slower than join.
That is, Multi:
exists $cells->{$p->[0],$p->[1]} or die;
was slower than Str:
exists $cells->{ join ':', @{$p} } or die;
Benchmark results:
Str: 7 wallclock secs ( 6.58 usr + 0.00 sys =.58 CPU) @ 30404.38/s (
+n=200000)
Multi: 8 wallclock secs ( 7.98 usr + 0.00 sys =.98 CPU) @ 25050.10/s (
+n=200000)
Update: though Multi was faster than:
exists $cells->{ join ':', $p->[0], $p->[1] } or die;
| [reply] [d/l] [select] |
|
> they turned out to be slower than join.
Interesting, I think that's because you address the points separately in multi (2 lookups) while join can optimize over an array (one list returned)
Maybe try to see how it compares with join over 2 elements.
I wouldn't be surprised if the result is equally fast. Multi dimensional hashes are an old (Perl 4) but somehow obscure feature. They probably never optimized it over the equivalence to join.
| [reply] |
|
[5, 5.5]
[5.5, 5 ]
then using the multidimensional hash approach in a locale where both $; and the radix char are a comma would result in ambiguous keys:
{5,5,5}
{5,5,5}
I know the original post specifies signed ints, so I'm just being more generic here. I also haven't checked if there is locale specific behaviour of $;
| [reply] [d/l] [select] |
|
|
|
Maybe stringification is faster
use Data::Dump qw/pp dd/;
local $" = $; ;
$p=[1,2];
$h{"@$p"}=12;
$h{3,4}=34;
dd %h;
("3\x1C4", 34, "1\x1C2", 12)
no time to benchmark, sorry!
(no pun intended ;-)
update
Probably is fastest to keep points all the time in that representation.
| [reply] [d/l] [select] |
|
Update: Regarding performance, stringification "@$p" reaches join(':',@$p) in Perl 5.22 and onwards. See this post.
Hi LanX,
Added lan_hash and lan_look, based on str_hash and str_look respectively. The stringification is a nice catch. Surprisingly, I didn't expect for it to run slower compared to join(':', @$p) and thought as fast or faster.
sub lan_hash {
# print "LanX_hash---------------\n";
my %cells;
# insert the points into the hash
for my $p (@points) {
my $h = $p->[0] . ' ' . $p->[1];
my ( $x, $y ) = split ' ', $h;
# print "x='$p->[0]' y='$p->[1]' h='$h' (x='$x' y='$y')\n";
if ($x != $p->[0]) { die; }
if ($y != $p->[1]) { die; }
$cells{$h} = undef;
# ++$cells{$h};
}
scalar(keys %cells) == $npoints or die;
# lookup each points in the hash
for my $p (@points) {
my $h = $p->[0] . ' ' . $p->[1];
exists $cells{$h} or die;
}
exists $cells{'notfound'} and die;
exists $cells{'notfound2'} and die;
exists $cells{'notfound3'} and die;
return \%cells;
}
sub lan_look {
my $cells = shift;
for my $p (@points) {
exists $cells->{ "@$p" } or die;
}
exists $cells->{'notfound'} and die;
exists $cells->{'notfound2'} and die;
exists $cells->{'notfound3'} and die;
}
my $big_ref = big_hash();
my $lan_ref = lan_hash();
my $mat_ref = mat_hash();
my $pak_ref = pak_hash();
my $st2_ref = str_hash();
my $st3_ref = str_hash();
my $str_ref = str_hash();
timethese 200000, {
Big => sub { big_look($big_ref) }, # $cells->{ ($p->[1] << 32) | ($
+p->[0] & 0xFFFFFFFF) }
Lan => sub { lan_look($lan_ref) }, # $cells->{ "@$p" }
Mat => sub { mat_look($mat_ref) }, # $cells->{ $_->[0] }{ $_->[1] }
Pak => sub { pak_look($pak_ref) }, # $cells->{ pack "ii", $p->[0],
+$p->[1] }
St2 => sub { st2_look($st2_ref) }, # $cells->{ join(':', @$p) }
St3 => sub { st3_look($st3_ref) }, # $cells->{ $str } # optimized
Str => sub { str_look($str_ref) }, # $cells->{ $p->[0] .':'. $p->[1
+] }
};
Results from CentOS 7.3 VM running Perl 5.16.3.
Regards, Mario | [reply] [d/l] [select] |
|
You did not answer his question though. Can you batch the queries or do you need to run them one by one?
If batching is allowed, then LanX has more or less revealed the solution. The problem is then very similar to sorting, and the fastest way to sort is probably the usual radix sort. (Maybe with AVX512 the balance has tipped in favor of merge sort; but I haven't tested that.)
Once you have binned/sorted the queries, the lookup is essentially a linear pass over the (sorted) dataset. Don't loop over queries, map them.
| [reply] |
|
Thanks for the tip.
There is some scope for batching, I believe, and I hadn't thought of sorting.
I'll write a new root node in the next day or so with details on the full problem.
Update: the new root node: High Performance Game of Life
| [reply] |
|
|