Do you know where your variables are? PerlMonks

### Re^2: The 10**21 Problem (Part 3)

by eyepopslikeamosquito (Chancellor)
 on May 14, 2014 at 20:07 UTC ( #1086076=note: print w/replies, xml ) Need Help??

in reply to Re: The 10**21 Problem (Part 3)
in thread The 10**21 Problem (Part 3)

Thanks for the great ideas to try. I suspect you work in this domain in your day job, is that right? In any case, you seem to have a lot more experience in HPC than me.

I was able to quickly test one of your excellent tips:

Better still: prefetch the data needed for next loop iteration, that is vecM[m7next]. It will then be ready at the right time.
Simply changing the inner loop from:
```for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp);
_mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0
+);
_mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI
+NT_T0);
UNROLL(1)
UNROLL(2)
UNROLL(3)
UNROLL(4)
UNROLL(5)
UNROLL(6)
UNROLL(7)
UNROLL(8)
UNROLL(9)
UNROLL(11)
UNROLL(12)
for (q8 = 14; q8 < 128; ++q8) {
UNROLL(q8)
}
}
to two inner loops:
```for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
m7arr[q7] = m7;
_mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0
+);
_mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI
+NT_T0);
}

for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = m7arr[q7];
s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp);
UNROLL(1)
UNROLL(2)
UNROLL(3)
UNROLL(4)
UNROLL(5)
UNROLL(6)
UNROLL(7)
UNROLL(8)
UNROLL(9)
UNROLL(11)
UNROLL(12)
for (q8 = 14; q8 < 128; ++q8) {
UNROLL(q8)
}
}
reduced the running time by 11 seconds, from 48 seconds to 37 seconds (371 to 286 years). The cost of the extra loop and storing 128 m7 values in an array is well worth it to reduce memory latency.

Update: This variation was 43 seconds, six seconds slower:

```for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
m7arr[q7] = m7;
_mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0
+);
_mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI
+NT_T0);
s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp);
s7arr[q7] = s7;
}

for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = m7arr[q7];
s7 = s7arr[q7];
// ... same UNROLL as before
}

Replies are listed 'Best First'.
Re^3: The 10**21 Problem (Part 3)
by oiskuu (Hermit) on May 16, 2014 at 06:38 UTC

Prefetching is tricky. I wouldn't try so many of them together. The number of outstanding requests is limited. How about this:

```// candidates for vectorization, lets break them apart
static inline void q7_to_m7(int m7[], int m6)
{
int q, m6x = m6 ^ 1;
for (q = 0; q < 128; q += 2) {      // unroll by 2
m7[q] = (m6 ^ q) * H_PRIME;
m7[q+1] = (m6x ^ q) * H_PRIME;
}
}

static inline void prefetch_m(unsigned int i)
{
_mm_prefetch(&bytevecM[i], _MM_HINT_T0);
_mm_prefetch(&bytevecM[i^64], _MM_HINT_T0);
}

...
prefetch_m((m6^1) * H_PRIME);
int m7arr[130];
q7_to_m7(m7arr, m6);
// fixup for prefetching two iterations ahead
m7arr[129] = m7arr[128] = m7arr[127];
m7arr[13] = m7arr[15]; m7arr[10] = m7arr[12];

prefetch_m(m7arr[2]);
for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
prefetch_m(m7arr[q7+2]);
m7 = m7arr[q7];
...
}

I ran your code as is and it took 47 seconds, ten seconds slower. I then changed:

```_mm_prefetch(&bytevecM[i], _MM_HINT_T0);
_mm_prefetch(&bytevecM[i^64], _MM_HINT_T0);
back to my original:
```_mm_prefetch(&bytevecM[(unsigned int)(i) & 0xffffff80], _MM_HINT_T0);
_mm_prefetch(&bytevecM[64+((unsigned int)(i) & 0xffffff80)], _MM_HINT_
+T0);
and it ran in 38 seconds, only one second slower. Note that the &0xffffff80 aligns on a 64 byte boundary while ensuring we get the two 64 byte cache lines required for the inner loop.

I profiled with VTune and both my (37 second) and your (38 second) solution showed up as having two (seven second) hotspots -- presumably due to memory latency -- in the same places, namely here:

```; 100  :         UNROLL(q8)

0x1400028e0      Block 178:
0x1400028e0      mov eax, r9d                      7.217s
0x1400028e3      xor rax, rdi                      0.060s
0x1400028e6      movzx r10d, byte ptr [rax+rsi*1]  0.100s
0x1400028eb      test r10d, r10d                   2.508s
0x1400028ee      jz 0x140002a0b <Block 192>
and here:
```; 99   :        for (q8 = 14; q8 < 128; ++q8) {

0x140002a0b     Block 192:
0x140002a0b     inc r9d                            7.008s
0x140002a0e     cmp r9d, 0x80                      0.690s
0x140002a15     jl 0x1400028e0 <Block 178>

Well, this is curious. Intel reference has this about prefetchtx:

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by locality hint.
There's no need to align the prefetch pointer. Be sure to align the data records themselves, of course.

I run a little pointer-chasing bench (on Nehalem). The optimum appears to be fetching ~16 links ahead. But this is just an empirical point. You could try increasing the prefetch distance.

There's a LOAD_HIT_PRE event that indicates too-late prefetches, might try that. Also, it helps to see clocks together with UOPS_RETIRED (or INST_RETIRED), to see whether it does a lot of work or a lot of stalling. Branch mispredictions may also show up there.

Update.

One article gives these figures for Haswell: 10 line fill buffers, 16 outstanding L2 misses. Prefetch hints that can't be tracked are simply dropped. There are also hardware prefetcher units that detect "streams" of consecutive requests in same direction. So yes, the order of memory accesses (prefetches too?) can make a difference.

Intel has some docs on tuning. Your loop could be improved in many ways, but don't get carried away. Figure out how you can lookup q8+q9 together, eliminating two inner loops.

Create A New User
Node Status?
node history
Node Type: note [id://1086076]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2017-11-23 14:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
In order to be able to say "I know Perl", you must have:

Results (336 votes). Check out past polls.

Notices?