Hm. That's the one I use for profiling C code; and I've found it very effective. Effective to the point of detecting a difference between two identical opcodes where one causes a cache miss and the other doesn't.
Ok, you've inspired me to look at sleepy again. Do you have any tips on using sleepy? Due to it sampling, I assume that the test cases need to run for some time? Any specific compile options I should use?
I isolated some of the code for the memory test case (the 80%+ slow down), and it turns our that the 64bit 5.24 version is much faster than the 32bit 5.8.9 version on basic perl/xs/c object creation/destruction. I need to do more digging.
I've been writing other test cases, and I'm suspecting something in the xs layer.