Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^5: The 10**21 Problem (Part 3)

by oiskuu (Hermit)
on May 17, 2014 at 22:05 UTC ( #1086469=note: print w/replies, xml ) Need Help??

in reply to Re^4: The 10**21 Problem (Part 3)
in thread The 10**21 Problem (Part 3)

Well, this is curious. Intel reference has this about prefetchtx:

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by locality hint.
There's no need to align the prefetch pointer. Be sure to align the data records themselves, of course.

I run a little pointer-chasing bench (on Nehalem). The optimum appears to be fetching ~16 links ahead. But this is just an empirical point. You could try increasing the prefetch distance.

There's a LOAD_HIT_PRE event that indicates too-late prefetches, might try that. Also, it helps to see clocks together with UOPS_RETIRED (or INST_RETIRED), to see whether it does a lot of work or a lot of stalling. Branch mispredictions may also show up there.


One article gives these figures for Haswell: 10 line fill buffers, 16 outstanding L2 misses. Prefetch hints that can't be tracked are simply dropped. There are also hardware prefetcher units that detect "streams" of consecutive requests in same direction. So yes, the order of memory accesses (prefetches too?) can make a difference.

Intel has some docs on tuning. Your loop could be improved in many ways, but don't get carried away. Figure out how you can lookup q8+q9 together, eliminating two inner loops.

Replies are listed 'Best First'.
Re^6: The 10**21 Problem (Part 3)
by eyepopslikeamosquito (Chancellor) on May 17, 2014 at 23:13 UTC

    There's no need to align the prefetch pointer.
    Whoops, yes you are right. I made a silly mistake in my original test; with that blunder fixed, with your version, there is no difference in running speed (both run in 38 seconds).

    Curiously, my version runs in 37 seconds with:

    _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HINT_T +0);
    versus 40 seconds with:
    _mm_prefetch(&bytevecM[(unsigned int)m7], _MM_HINT_T0); _mm_prefetch(&bytevecM[(unsigned int)m7 ^ 64], _MM_HINT_T0);
    I have no explanation for that, unless perhaps the prefetcher likes to prefetch in order (?).

    Thanks for the other tips. Optimizing for the prefetcher seems to be something of a dark art -- if you know of any cool links on that, please let me know.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1086469]
[Discipulus]: last hour of cb broken..

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2018-01-19 17:48 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (222 votes). Check out past polls.