Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: The 10**21 Problem (Part 3)

by oiskuu (Friar)
on May 12, 2014 at 21:26 UTC ( #1085841=note: print w/ replies, xml ) Need Help??


in reply to The 10**21 Problem (Part 3)

A few points, if I may:

  • On modern linux configured with transparent hugepages, you (likely) get hugepages for big segments without any special code or handling.
  • Your 1st conservative unrolling is more accurately called loop peeling. The second form is complete loop peeling. To unroll a loop you merge N*loop bodies and iterate for (i=0; i<len; i += N) { BODY(i); BODY(i+1); ... }
  • I expect very little benefits from unrolling the complex inner loop. (But certainly I'd test a 2x unroll. Sometimes weird effects may be at work.)
  • Random memory access with hugepages is still on the order of 200 clocks or so. This time would be well spent on calculations. I'd therefore start with a few iterations of type a (=calculated), then switch over to type b (=lookup) once the prefetch has completed. Either loop might be unrolled.
  • Better still: prefetch the data needed for next loop iteration, that is vecM[m7next]. It will then be ready at the right time. (*)
  • Unrolling benefits may not carry over when hyperthreading is accounted for.
  • The hashing step+modulus involves a chain of 3 muls, meaning latency > 3*4+some. Loop iterations may overlap, except quarter of the tests vs 500+-127 will mispredict. This suggests another option of calculating a temporary q8_to_d9[128] in a tight loop (that allows good overlap), followed by a loop with the tests.
  • The q8_to_d9[] is a good candidate for vectorization. So is q9_to_c9[]. Inbetween, grep from d9 to q9. I think vectorized code will beat the lookup variant. Or not. Lookup+vectorized q9_to_c9 might be better. (*)
  • Test with different compilers. gcc, clang.

All in all, this problem seems like a nice little assignment for a competent cryptanalyst.

Modulo cmps. Since (2**31)%1000003 == 477207, we know we may safely add small values to the hashval without fear of sign change or overflow. This allows one to rewrite tests such as
    v %= 1001; if (v < 0) v += 1001; if (v != 100) continue;
in a simpler, speedier form:
    v += 1001 - 100; if (v % 1001) continue;

(*) Updated.


Comment on Re: The 10**21 Problem (Part 3)
Select or Download Code
Re^2: The 10**21 Problem (Part 3)
by eyepopslikeamosquito (Canon) on May 14, 2014 at 20:07 UTC

    Thanks for the great ideas to try. I suspect you work in this domain in your day job, is that right? In any case, you seem to have a lot more experience in HPC than me.

    I was able to quickly test one of your excellent tips:

    Better still: prefetch the data needed for next loop iteration, that is vecM[m7next]. It will then be ready at the right time.
    Simply changing the inner loop from:
    for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp); _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0 +); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI +NT_T0); UNROLL(1) UNROLL(2) UNROLL(3) UNROLL(4) UNROLL(5) UNROLL(6) UNROLL(7) UNROLL(8) UNROLL(9) UNROLL(11) UNROLL(12) for (q8 = 14; q8 < 128; ++q8) { UNROLL(q8) } }
    to two inner loops:
    for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; m7arr[q7] = m7; _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0 +); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI +NT_T0); } for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = m7arr[q7]; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp); UNROLL(1) UNROLL(2) UNROLL(3) UNROLL(4) UNROLL(5) UNROLL(6) UNROLL(7) UNROLL(8) UNROLL(9) UNROLL(11) UNROLL(12) for (q8 = 14; q8 < 128; ++q8) { UNROLL(q8) } }
    reduced the running time by 11 seconds, from 48 seconds to 37 seconds (371 to 286 years). The cost of the extra loop and storing 128 m7 values in an array is well worth it to reduce memory latency.

    Update: This variation was 43 seconds, six seconds slower:

    for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; m7arr[q7] = m7; _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0 +); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI +NT_T0); s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp); s7arr[q7] = s7; } for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = m7arr[q7]; s7 = s7arr[q7]; // ... same UNROLL as before }

      Prefetching is tricky. I wouldn't try so many of them together. The number of outstanding requests is limited. How about this:

      // candidates for vectorization, lets break them apart static inline void q7_to_m7(int m7[], int m6) { int q, m6x = m6 ^ 1; for (q = 0; q < 128; q += 2) { // unroll by 2 m7[q] = (m6 ^ q) * H_PRIME; m7[q+1] = (m6x ^ q) * H_PRIME; } } static inline void prefetch_m(unsigned int i) { _mm_prefetch(&bytevecM[i], _MM_HINT_T0); _mm_prefetch(&bytevecM[i^64], _MM_HINT_T0); } ... prefetch_m((m6^1) * H_PRIME); int m7arr[130]; q7_to_m7(m7arr, m6); // fixup for prefetching two iterations ahead m7arr[129] = m7arr[128] = m7arr[127]; m7arr[13] = m7arr[15]; m7arr[10] = m7arr[12]; prefetch_m(m7arr[2]); for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; prefetch_m(m7arr[q7+2]); m7 = m7arr[q7]; ... }

        I ran your code as is and it took 47 seconds, ten seconds slower. I then changed:

        _mm_prefetch(&bytevecM[i], _MM_HINT_T0); _mm_prefetch(&bytevecM[i^64], _MM_HINT_T0);
        back to my original:
        _mm_prefetch(&bytevecM[(unsigned int)(i) & 0xffffff80], _MM_HINT_T0); _mm_prefetch(&bytevecM[64+((unsigned int)(i) & 0xffffff80)], _MM_HINT_ +T0);
        and it ran in 38 seconds, only one second slower. Note that the &0xffffff80 aligns on a 64 byte boundary while ensuring we get the two 64 byte cache lines required for the inner loop.

        I profiled with VTune and both my (37 second) and your (38 second) solution showed up as having two (seven second) hotspots -- presumably due to memory latency -- in the same places, namely here:

        ; 100 : UNROLL(q8) 0x1400028e0 Block 178: 0x1400028e0 mov eax, r9d 7.217s 0x1400028e3 xor rax, rdi 0.060s 0x1400028e6 movzx r10d, byte ptr [rax+rsi*1] 0.100s 0x1400028eb test r10d, r10d 2.508s 0x1400028ee jz 0x140002a0b <Block 192>
        and here:
        ; 99 : for (q8 = 14; q8 < 128; ++q8) { 0x140002a0b Block 192: 0x140002a0b inc r9d 7.008s 0x140002a0e cmp r9d, 0x80 0.690s 0x140002a15 jl 0x1400028e0 <Block 178>

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1085841]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-11-26 21:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (174 votes), past polls