Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: "Just use a hash": An overworked mantra?

by BrowserUk (Patriarch)
on Nov 17, 2011 at 01:25 UTC ( [id://938503]=note: print w/replies, xml ) Need Help??


in reply to Re^2: "Just use a hash": An overworked mantra?
in thread "Just use a hash": An overworked mantra?

I'm going to create an example here using contrived numbers from my own recollection since I am not looking directly at my original post or the profiling data right now....

Sorry, but those projections, even as talking points are so far out as to be almost meaningless.

A few real statistics:

## Gen a file of 100e6 random ints:1 .. 999 perl -E"say 1+ int( rand 999 ) for 1 .. 100e6" > rands.dat ## Now count them using IO and a hash and write the counts to a file: [22:59:34.88] c:\test>perl -nlE"++$h{ $_ } }{ say qq[$_ :: $h{ $_ }] f +or sort{$a<=>$b} keys %h" rands.dat > hash.res [23:00:16.83] c:\test> rem 41.95 seconds ## Same thing using an array: [23:06:58.13] c:\test>perl -nlE"++$h[ $_ ] }{ say qq[$_ :: $h[ $_ ]] f +or 0 .. $#h" rands.dat > array.res [23:07:38.02] c:\test> rem 39.49 seconds

The timings of multiple runs is consistent to within 3 or 4 1/100ths for both methods and the 2 second difference on 100e6 data items is negligible. And even some of that difference is accounted for by the need to sort the keys prior to output in order that the result files are the same.

the XS approach takes

Sorry again, but you are still missing that to use XS code for this, you either:

  1. need to load the whole dataset into memory, which on my 4GB/64-bit machine takes over 10 minutes:
    [ 0:37:15.07] c:\test>perl -nE"$a[$.-1]=0+$_" rands.dat [ 0:47:23.27] c:\test>

    (It moves into swapping). And that's before you even start counting the integers.

  2. Or you need to call into the XS code for every line:
    #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => 'junk1_IC', CLEAN_AFTER_BUILD => 0 +; IV initSieve( int size ) { int *sieve; Newxz( sieve, size, int ); if( !sieve ) croak( "ints_per_bucket: Couldn't allocate memory.\n" ); return (IV)sieve; } void sieve( IV psieve, int val ) { int *sieve = (int*)psieve; ++sieve[ val ]; return; } void sieveStats( IV psieve, int size ) { Inline_Stack_Vars; int *sieve = (int*)psieve; int i; Inline_Stack_Reset; for( i = 0; i < size; ++i ) { Inline_Stack_Push( sv_2mortal( newSViv( sieve[ i ] ) ) ); } Inline_Stack_Done; } void freeSieve( IV psieve ) { int *sieve = (int*)psieve; if( sieve ) Safefree( sieve ); return; } END_C use Data::Dump qw [ pp ]; my $sieve = initSieve( 999 ); sieve( $sieve, 0+$_ ) while <>; my @res = sieveStats( $sieve, 999 ); pp \@res; freeSieve( $sieve );

    Which means you are making 100e6 calls from perl to XS, and then 100e6 calls from there into C and results in the XS code taking half as long again as the pure Perl solutions:

    [ 0:32:32.61] c:\test>junk1-ic rands.dat 2>nul [ 0:33:28.69] c:\test>rem 56.08 seconds
  3. Or, you move the the file handling into the XS also.

    At which point you may as well skip the Perl and write the solution directly in C. And it is still doubtful if you would gain any significant savings on the pure Perl solutions, and would definitely have lost out when you consider the time taken to write and compile the C program, over using a Perl one-liner.

The data set was large enough that it would be a freak of improbability for a well-performing random generator to not at least return some values at the top most and bottom most positions in the range. I was using the tests to ensure that I had filled my allocated array from 1 to 999 without undershooting or overshooting it.

In random data, it is unlikely, but not impossible. As a place-holder for the data in the problem you started with, completely possible. Depending on how that data was derived, maybe even likely.

In effect, you were conflating a test of your random number generator with a test of the implementation of the algorithm.

If your purpose is to test that generator, far better to test it in isolation, running over a range of say 1..9 for a few thousand iterations and check not only that all the possible values are generated, but they occur with approximately equal frequency.

IMO, adding these tests to the code for your meditation smacks of wanting to be seen to "do the right thing" rather than actually serving any real purpose for meditation itself.

Don't take my critisisms as all negative.

I realise -- from your recent posts -- that you are just getting into IC/XS and this probably started out as an experiment in seeing what you could do with IC. And for that I commend both your experiment and the posting of your findings in your meditation. I hope it will encourage others to take the step. This place would be all the better for more IC/XS coverage than it currently gets.

Basically all I'm saying is that whilst I read your meditation with considerable interest, I find the case for your premise unproven by your methodology.

That doesn't mean I don't think it is valuable. One of the first lessons you have to learn with IC is when it is worth going there and when not. In that respect, this is a great example of the care that must be taken when making that assessment.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://938503]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-03-28 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found