Clear questions and runnable code
get the best and fastest answer
This has to be the best example of bad testing I've ever seen.
Now for the benchmark.
Incorporating the random generation of the dataset into each of the benchmarks is like incorporating the building of the track into an F1 race. The generation of the test data completely obscures the thing you are trying to test.
This statement: "In the interest of factoring out commonality I used the same random number generating code for all three subs." is completely wrong.
That you are calling subroutines with the same apparent name: rand_ranged_value() completely misses the fact that the Perl code is actually calling a subroutine called XS_main_rand_ranged_value() which does a bunch of stuff both before and after it calls the subroutine that the C code calls directly:
So what you are actually benchmarking is the C compilers ability to compile-time optimise (eg.inline) a 100e6 calls to another C subroutine versus Perl's inability to optimise (at runtime) 100e6 calls to an XS subroutine which unwraps some native values from their Perl scalar wrappers before calling (unoptimised) the C subroutine 100e6 times, before wrapping the resultant native integer back into a Perl scalar. The result is far more than 2 x times skewed in favour of the C code.
Which might be alright if that was what you set out to benchmark, but it isn't. The actual process of counting the unique integer values is simply lost in the noise.
If you have need to derive counts of unique values in a huge dataset of ints, then that huge dataset must already exist, so it won't need to be "generated". Which makes adding it into the benchmark ...
Equally significant is the fact that storing 100 million integers in memory (in Perl) would take at least 3.2GB. Which precludes using a 32-bit perl for the job. And of course, those values will need to come from somewhere. Probably a file. And once you add IO into the equation, the roughly 60% speed advantage of using an array over a hash for this purpose:
Will be completely lost in the noise of the IO anyway. And if the data is in a file, then you don't need to load it all into memory to perform the counting:
So if you had to go through the process of building that 3.2GB array in order to use the C version, the time saved in C would be swamped by the time spent allocating the memory and building the array.
And if you decided to call an XS subroutine for every line of the file to avoid that, then the XS would fare very badly relative to either of the pure Perl methods.
So, to answer your title question "Just use a hash": An overworked mantra?: No, It isn't. Whilst for this specific dataset: a set consisting of a contiguous range of low value integers; and if the dataset is already in memory, using an array may have some performance advantage, for the general case "Use a hash" for this purpose is very good advice.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.