Re: "Just use a hash": An overworked mantra?

This has to be the best example of bad testing I've ever seen.

*_count(): Correct key count in RV.
This is a pointless and logically incorrect test.
What if one of the numbers in the range 1 .. 999 was never generated by your random number generator?
*_count(): Correct value count in RV.
You've just tested how many keys are in the hash or array. Do you expect the number of values to be different? Can they be different?
*_count(): max integer in range.
Again, a pointless and incorrect test.
It is perfectly legitimate -- even if unlikely -- for a sequence of random numbers to not contain the largest value possible.
*_count(): min integer in range.
Ditto.

Now for the benchmark.

Incorporating the random generation of the dataset into each of the benchmarks is like incorporating the building of the track into an F1 race. The generation of the test data completely obscures the thing you are trying to test.

This statement: "In the interest of factoring out commonality I used the same random number generating code for all three subs." is completely wrong.

That you are calling subroutines with the same apparent name: rand_ranged_value() completely misses the fact that the Perl code is actually calling a subroutine called XS_main_rand_ranged_value() which does a bunch of stuff both before and after it calls the subroutine that the C code calls directly:

XS(XS_main_rand_ranged_value)
{
#ifdef dVAR
    dVAR; dXSARGS;
#else
    dXSARGS;
#endif
    if (items != 2)
       croak_xs_usage(cv,  "min, max");
    {
    int    min = (int)SvIV(ST(0));
    int    max = (int)SvIV(ST(1));
    int    RETVAL;
    dXSTARG;

    RETVAL = rand_ranged_value(min, max);
    XSprePUSH; PUSHi((IV)RETVAL);
    }
    XSRETURN(1);
}
[download]

So what you are actually benchmarking is the C compilers ability to compile-time optimise (eg.inline) a 100e6 calls to another C subroutine versus Perl's inability to optimise (at runtime) 100e6 calls to an XS subroutine which unwraps some native values from their Perl scalar wrappers before calling (unoptimised) the C subroutine 100e6 times, before wrapping the resultant native integer back into a Perl scalar. The result is far more than 2 x times skewed in favour of the C code.

Which might be alright if that was what you set out to benchmark, but it isn't. The actual process of counting the unique integer values is simply lost in the noise.

If you have need to derive counts of unique values in a huge dataset of ints, then that huge dataset must already exist, so it won't need to be "generated". Which makes adding it into the benchmark ...

Equally significant is the fact that storing 100 million integers in memory (in Perl) would take at least 3.2GB. Which precludes using a 32-bit perl for the job. And of course, those values will need to come from somewhere. Probably a file. And once you add IO into the equation, the roughly 60% speed advantage of using an array over a hash for this purpose:

#! perl -slw
use strict;
use Math::Random::MT qw[ rand srand ];
use Benchmark qw[ cmpthese ];

sub aCount {
    my $aRef = shift;
    my @counts;
    ++$counts[ $aRef->[ $_ ] ] for 0 .. $#{ $aRef };
    return \@counts;
}

sub hCount {
    my $aRef = shift;
    my %counts;
    ++$counts{ $aRef->[ $_ ] } for 0 .. $#{ $aRef };
    return \%counts;
}


our $N //= 1e6;

our @rands; $#rands = $N;
$rands[ $_ ] = 1+ int( rand( 999 ) ) for 0 .. $N;

cmpthese -1, {
    array => q[ my $res = aCount( \@rands ); ],
    hash  => q[ my $res = hCount( \@rands ); ],
};

__END__
C:\test>junk1
        Rate  hash array
hash  3.94/s    --  -36%
array 6.15/s   56%    --

C:\test>junk1
        Rate  hash array
hash  3.88/s    --  -38%
array 6.23/s   61%    --
[download]

Will be completely lost in the noise of the IO anyway. And if the data is in a file, then you don't need to load it all into memory to perform the counting:

perl 
-nlE"++$h{ $_ } }{ say qq[$_ : $h{ $_ }] for sort{ $a <=> $b } keys %h
+" 
< theBigFileOfInts
[download]

So if you had to go through the process of building that 3.2GB array in order to use the C version, the time saved in C would be swamped by the time spent allocating the memory and building the array.

And if you decided to call an XS subroutine for every line of the file to avoid that, then the XS would fare very badly relative to either of the pure Perl methods.

So, to answer your title question "Just use a hash": An overworked mantra?: No, It isn't. Whilst for this specific dataset: a set consisting of a contiguous range of low value integers; and if the dataset is already in memory, using an array may have some performance advantage, for the general case "Use a hash" for this purpose is very good advice.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: "Just use a hash": An overworked mantra? Select or Download Code

Replies are listed 'Best First'.
Re^2: "Just use a hash": An overworked mantra? by davido (Cardinal) on Nov 16, 2011 at 22:53 UTC
I do appreciate you taking the time to look it over. I've had my own concerns as well, and value your input. Let's talk about the random number generator for a moment because I have concerns too. Let me start by saying that I agree 100% that it's an unfortunate compromise, and I would be interested in hearing alternatives. As you pointed out it's unlikely that anyone is holding onto 3.2GB of integers in memory, but I needed a source of data. The original problem stated the integers were held in files. It would probably be worthwhile for me to alter the benchmark to pull in integers from a file instead. But then I'm testing IO, not the algorithm. Nevertheless, the random number generator is a poor alternative. Before posting, I was already dissatisfied enough with it that I profiled the code to see what affect it was having on the overall algorithms. The calls to the random number generator were consuming about 1/3rd of the time spent in the Perl array-method comparison sub. In the hash method, it was a nearly equal number of seconds spent in the random number generator, so the ratio was lower. The ratio was about about 1/3rd of the time for the XS comparison too, and that alone pretty clearly demonstrated that when Perl called the random number generator it was more expensive than when C did. I considered posting profiling data as well, but really a better approach is needed altogether. However, what we do have is consistency in the Perl functions. The array method is calling the same random generator, with the same overhead as the hash method. I'm going to create an example here using contrived numbers from my own recollection since I am not looking directly at my original post or the profiling data right now. Let's say that we spend 1/3rd of 12.5 seconds in generating random numbers (that's pretty accurate). That's a little over 4 seconds. In the XS version of the test we spent about 2/3rds of a second. Let's factor that out. That gives us 20 seconds per iteration for the hash method, 8.5 seconds for the array method, and 1.4 seconds for each iteration of the XS method. Add back in whatever constant time you want for disk IO: 60 seconds per test iteration? The hash approach takes 80 seconds now, the array approach takes 72.5, and the XS approach takes 61.4. None is fast, but considering the relative ease with which the quicker approaches can be applied isn't the improvement significant enough to investigate? The data set was large enough that it would be a freak of improbability for a well-performing random generator to not at least return some values at the top most and bottom most positions in the range. I was using the tests to ensure that I had filled my allocated array from 1 to 999 without undershooting or overshooting it. If the tests showed I didn't hit the top and bottom, there were enough data pieces that it would be worth checking whether the generator was flawed. I wanted to also verify that when I hit the top of the array I didn't overshoot it. If I had, rather than the test failing I might have just crashed, but even if that happened, the tests served their purpose in exercising the code at its edges, and would have indicated to me where I was when such an event took place. Bad testing is none at all. This falls way short of good testing, but it was what I needed to check for some off-by-one possibilities. The key/value test was redundant though. And the tests were far from comprehensive. There was a narrow set of issues I wanted to double check. Dave	[reply]
Re^3: "Just use a hash": An overworked mantra? by BrowserUk (Patriarch) on Nov 17, 2011 at 01:25 UTC
I'm going to create an example here using contrived numbers from my own recollection since I am not looking directly at my original post or the profiling data right now.... Sorry, but those projections, even as talking points are so far out as to be almost meaningless. A few real statistics: ## Gen a file of 100e6 random ints:1 .. 999 perl -E"say 1+ int( rand 999 ) for 1 .. 100e6" > rands.dat ## Now count them using IO and a hash and write the counts to a file: [22:59:34.88] c:\test>perl -nlE"++$h{ $_ } }{ say qq[$_ :: $h{ $_ }] f +or sort{$a<=>$b} keys %h" rands.dat > hash.res [23:00:16.83] c:\test> rem 41.95 seconds ## Same thing using an array: [23:06:58.13] c:\test>perl -nlE"++$h[ $_ ] }{ say qq[$_ :: $h[ $_ ]] f +or 0 .. $#h" rands.dat > array.res [23:07:38.02] c:\test> rem 39.49 seconds [download] The timings of multiple runs is consistent to within 3 or 4 1/100ths for both methods and the 2 second difference on 100e6 data items is negligible. And even some of that difference is accounted for by the need to sort the keys prior to output in order that the result files are the same. the XS approach takes Sorry again, but you are still missing that to use XS code for this, you either: need to load the whole dataset into memory, which on my 4GB/64-bit machine takes over 10 minutes: `[ 0:37:15.07] c:\test>perl -nE"$a[$.-1]=0+$_" rands.dat [ 0:47:23.27] c:\test>` [download] (It moves into swapping). And that's before you even start counting the integers. Or you need to call into the XS code for every line: <Reveal this spoiler or all in this thread> Which means you are making 100e6 calls from perl to XS, and then 100e6 calls from there into C and results in the XS code taking half as long again as the pure Perl solutions: `[ 0:32:32.61] c:\test>junk1-ic rands.dat 2>nul [ 0:33:28.69] c:\test>rem 56.08 seconds` [download] Or, you move the the file handling into the XS also. At which point you may as well skip the Perl and write the solution directly in C. And it is still doubtful if you would gain any significant savings on the pure Perl solutions, and would definitely have lost out when you consider the time taken to write and compile the C program, over using a Perl one-liner. The data set was large enough that it would be a freak of improbability for a well-performing random generator to not at least return some values at the top most and bottom most positions in the range. I was using the tests to ensure that I had filled my allocated array from 1 to 999 without undershooting or overshooting it. In random data, it is unlikely, but not impossible. As a place-holder for the data in the problem you started with, completely possible. Depending on how that data was derived, maybe even likely. In effect, you were conflating a test of your random number generator with a test of the implementation of the algorithm. If your purpose is to test that generator, far better to test it in isolation, running over a range of say 1..9 for a few thousand iterations and check not only that all the possible values are generated, but they occur with approximately equal frequency. IMO, adding these tests to the code for your meditation smacks of wanting to be seen to "do the right thing" rather than actually serving any real purpose for meditation itself. Don't take my critisisms as all negative. I realise -- from your recent posts -- that you are just getting into IC/XS and this probably started out as an experiment in seeing what you could do with IC. And for that I commend both your experiment and the posting of your findings in your meditation. I hope it will encourage others to take the step. This place would be all the better for more IC/XS coverage than it currently gets. Basically all I'm saying is that whilst I read your meditation with considerable interest, I find the case for your premise unproven by your methodology. That doesn't mean I don't think it is valuable. One of the first lessons you have to learn with IC is when it is worth going there and when not. In that respect, this is a great example of the care that must be taken when making that assessment. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: "Just use a hash": An overworked mantra? by Tanktalus (Canon) on Nov 17, 2011 at 00:29 UTC
The original problem stated the integers were held in files. It would probably be worthwhile for me to alter the benchmark to pull in integers from a file instead. But then I'm testing IO, not the algorithm. Not really. You'd at least be testing the real-world scenario. And what you'd find is most likely that the IO so completely swamps everything else once you get into using a hash that further savings will be immaterial. If saving the 20 seconds that the hash costs is important, you could use IO::AIO to read in the data and insert it into the hash - if I read that module right, you should be able to be inserting and reading simultaneously, which now brings you solely to reading speed (whether local disk or over the network). I doubt it'll be important. Basically, this is the definition of premature optimisation. You're micro-optimising something that will have no effect on the overall speed because you're ignoring the elephant in the algorithm: disk IO. That and the payoff isn't that great even when you do make this optimisation. Dropping from the hash to the array is a huge savings for the cost involved. Dropping to XS is not as good because the development and debug time likely will completely dwarf the savings again :-) For the record, I expect using IO::AIO to be cheaper to implement than your XS version, give a negligibly better savings, and still not be worth it overall :-)	[reply]


Welcome to the Monastery
	PerlMonks

Re: "Just use a hash": An overworked mantra?

Don't take my critisisms as all negative.