Perl-Sensitive Sunglasses | |
PerlMonks |
Re^5: The 10**21 Problem (Part 3)by oiskuu (Hermit) |
on May 17, 2014 at 22:05 UTC ( [id://1086469]=note: print w/replies, xml ) | Need Help?? |
Well, this is curious. Intel reference has this about prefetchtx:
I run a little pointer-chasing bench (on Nehalem). The optimum appears to be fetching ~16 links ahead. But this is just an empirical point. You could try increasing the prefetch distance. There's a LOAD_HIT_PRE event that indicates too-late prefetches, might try that. Also, it helps to see clocks together with UOPS_RETIRED (or INST_RETIRED), to see whether it does a lot of work or a lot of stalling. Branch mispredictions may also show up there. Update. One article gives these figures for Haswell: 10 line fill buffers, 16 outstanding L2 misses. Prefetch hints that can't be tracked are simply dropped. There are also hardware prefetcher units that detect "streams" of consecutive requests in same direction. So yes, the order of memory accesses (prefetches too?) can make a difference. Intel has some docs on tuning. Your loop could be improved in many ways, but don't get carried away. Figure out how you can lookup q8+q9 together, eliminating two inner loops.
In Section
Meditations
|
|