Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

The 10**21 Problem (Part 3)

by eyepopslikeamosquito (Canon)
on May 11, 2014 at 13:04 UTC ( #1085685=perlmeditation: print w/ replies, xml ) Need Help??

To recap, in parts one and two we reduced the estimated running time of a magic formula search program from 52,887,477 years down to just 728 years. Though that was gratifying, I really didn't want to wait that long. I was happy to wait a year or two, but no more. So I had to find a way to somehow make the search program run a hundred times faster.

That's not a minor disadvantage, we're talking about things being 50 to 100 times slower with a linked list. Compactness matters. Vectors are more compact than lists. And predictable usage patterns matter enormously. With a vector, you have to shove a lot of elements over, but caches are really really good at that.

When you traverse lists you keep doing random access ... so you are actually random accessing your memory and you are maximizing your cache misses, which is exactly the opposite of what you want.

-- Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (from GoingNative 2012 Keynote)

At the end of Part 2, you may recall that we had become stalled by a nasty "cache miss" performance problem, similar to the one Stroustrup describes above. By following Stroustrup's advice to "stay compact, stay predictable" -- and without changing the essential algorithm -- I was able to eventually achieve the massive speed up required.

Sometimes you think a code change "should" improve performance yet it does not. Each specific code "performance improvement" therefore has to be benchmarked separately, accepted if it improves performance, backed out if not. Painstaking work. One step at a time ... one insight leads to another ... just like playing code golf. :-)

Lookup Table vs Calculation

In Part 1 we noticed, when playing golf, there is often a battle between a lookup table and a magic formula. Curiously, when searching for a code golf magic formula, there is often a similar battle, this time between a lookup table and a calculation: for example, to eliminate the inner loop, ambrus used a calculation; I used a lookup table.

When searching for our magic formula, lookup table wins for the first Roman Numeral value, it's a tie for the second value, and calculation wins for all subsequent values. Why?

Well, from Part 2, the code to calculate the first Roman Numeral value (D):

d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500 - 127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500 - d9 : d9 - 500; q9 = (d8 & 127) ^ dist; if (q9 == 0 || q9 == 10 || q9 == 13) continue;
is significantly more work than the code to calculate the second and subsequent values namely:
c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue;
Apart from being more work, the first value always needs to be calculated, while the second value only need be calculated about 255 times per 1001 (i.e. when d9 above is in the range 373..627). Since the second value matches only once per 1001 (i.e. when c9 above is precisely 100), the third value only need be calculated with a probability of (255/1001) * (1/1001), the fourth value (255/1001) * (1/1001) * (1/1001), ... and so on. Testing showed that hitting the third and subsequent values (i.e. L, X, M) is a sufficiently rare event that, with current commodity hardware, it is not worth cluttering all the memory caches with their 4GB lookup tables.

Of course, whether lookup table is faster than calculation depends on CPU versus memory speed. And main memory speed can be improved by spending a lot of money. At one stage, I was considering boosting memory performance by buying expensive high speed memory and overclocking my PC.

By the way, a nice bonus from reducing five 4GB lookup tables down to just one is that the program's total memory requirement is now much lower, allowing us to run it in certain massively parallel environments, such as Xeon Phi for example (which currently has a 6GB limit per core).

Here is the inner loop code revised to use just one 4GB lookup table:

#define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; int inner( const unsigned char* bytevecM, int startval, int endval, int m2, int d2, int c2, int l2, int x2, my_m128_t* ps) { int m3, m4, m5, m6, m7; int d3, d4, d5, d6, d7, d9; int c3, c4, c5, c6, c7, c9; int l3, l4, l5, l6, l7, l9; int x3, x4, x5, x6, x7, x9; int q3, q4, q5, q6, q7, q8, q9; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10 || q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; d3 = (d2 ^ q3) * H_PRIME; c3 = (c2 ^ q3) * H_PRIME; l3 = (l2 ^ q3) * H_PRIME; x3 = (x2 ^ q3) * H_PRIME; for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10 || q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; d4 = (d3 ^ q4) * H_PRIME; c4 = (c3 ^ q4) * H_PRIME; l4 = (l3 ^ q4) * H_PRIME; x4 = (x3 ^ q4) * H_PRIME; for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10 || q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; d5 = (d4 ^ q5) * H_PRIME; c5 = (c4 ^ q5) * H_PRIME; l5 = (l4 ^ q5) * H_PRIME; x5 = (x4 ^ q5) * H_PRIME; for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10 || q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; d6 = (d5 ^ q6) * H_PRIME; c6 = (c5 ^ q6) * H_PRIME; l6 = (l5 ^ q6) * H_PRIME; x6 = (x5 ^ q6) * H_PRIME; for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; d7 = (d6 ^ q7) * H_PRIME; c7 = (c6 ^ q7) * H_PRIME; l7 = (l6 ^ q7) * H_PRIME; x7 = (x6 ^ q7) * H_PRIME; for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10 || q8 == 13) continue; q9 = bytevecM[(unsigned int)(m7 ^ q8)]; if (q9 == 0) continue; d9 = (d7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 != 500) continue; c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } return iret; }
As you can see, we simply combined the "first theorem of ambrus" code and the "first hack of eyepopslikeamosquito" code from Part 2.

Doing this reduced the running time from 95 to 74 seconds = 74 * 125**4 / 60 / 60 / 24 / 365 = 572 years.

Prefetching

In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware and software, is among our most important available techniques for doing so.

-- When Prefetching Works, When it Doesn't, and Why (pdf)

What more can be done to speed up our (newly single) 4GB lookup table? Well, before we enter the inner loop, we already know the limited range of values required from the bytevecM lookup table. In fact, just as in ambrus's first theorem, there are only 128 values, and they are all right next to each other in memory. To reduce memory latency, can we somehow tell the computer to start the tortuous load sequence of those 128 bytes from main memory before we enter the inner loop? Yes! The MSVC _mm_prefetch compiler instrinsic, which under the covers generates the x86 SSE SIMD prefetcht0 instruction, does just that. That is, adding the following code immmediately before the q8 loop above:

_mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HINT_T +0);
begins bringing into all CPU caches the two (64-byte) cache lines required by the q8 loop for the bytevecM lookup table before the loop starts. Note that 0xffffff80 above is just ~127 in disguise.

Adding this prefetch hack reduced the running time by 6 seconds to 68 seconds = 68 * 125**4 / 60 / 60 / 24 / 365 = 526 years. Update: this was later improved from 6 to 17 seconds.

TLB Performance

There are a couple of factors which influence TLB performance. The first is the size of the pages. Obviously, the larger a page is, the more instructions or data objects will fit into it. So a larger page size reduces the overall number of address translations which are needed, meaning that fewer entries in the TLB cache are needed. Most architectures nowadays allow the use of multiple different page sizes; some sizes can be used concurrently. For instance, the x86/x86-64 processors have a normal page size of 4kB but they can also use 4MB and 2MB pages respectively.

-- Ulrich Drepper

Larger page sizes mean that a TLB cache of the same size can keep track of larger amounts of memory, which avoids costly TLB misses.

-- Page (wikipedia)

As indicated above, there is a simple way to significantly boost TLB performance: just use large pages!

So, I simply increased the (Windows 8) Operating System page size via the following Win32 code and switched from malloc to my_big_malloc below when allocating memory for the 4GB lookup table. Note that running this code requires Administrator privileges.

BOOL MySetLockPagesPrivilege(HANDLE hProcess, BOOL bEnable) { struct { DWORD Count; LUID_AND_ATTRIBUTES Privilege[1]; } Info; HANDLE Token; if ( !OpenProcessToken(hProcess, TOKEN_ADJUST_PRIVILEGES, &Token) ) + { printf("Cannot open process token.\n"); return FALSE; } Info.Count = 1; if (bEnable) { Info.Privilege[0].Attributes = SE_PRIVILEGE_ENABLED; } else { Info.Privilege[0].Attributes = 0; } if ( !LookupPrivilegeValue(NULL, SE_LOCK_MEMORY_NAME, &(Info.Privil +ege[0].Luid)) ) { printf("Cannot get privilege for %s.\n", SE_LOCK_MEMORY_NAME); return FALSE; } if ( !AdjustTokenPrivileges(Token, FALSE, (PTOKEN_PRIVILEGES)&Info, + 0, NULL, NULL) ) { printf ("Cannot adjust token privileges (%u)\n", GetLastError()) +; return FALSE; } // ERROR_NOT_ALL_ASSIGNED // 1300 (0x514) // Not all privileges or groups referenced are assigned to the call +er. DWORD dwErr = GetLastError(); printf("AdjustTokenPrivileges dwErr=%u\n", dwErr); if (dwErr != ERROR_SUCCESS) { printf("Cannot enable the SE_LOCK_MEMORY_NAME privilege"); return FALSE; } CloseHandle(Token); return TRUE; } void LargeMemInit() { if ( !MySetLockPagesPrivilege(GetCurrentProcess(), TRUE) ) { fprintf(stderr, "MySetLockPagesPrivilege failed.\n"); exit(1); } } // VirtualAlloc may fail with: ERROR_NO_SYSTEM_RESOURCES = 1450. To fi +x, reboot. static void* my_big_malloc(size_t s) { DWORD dwErr; void* pMem; pMem = VirtualAlloc( NULL, // Let system decide where to al +locate it s, // size of memory block MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES, PAGE_READWRITE ); if (pMem == NULL) { // Error 1314 = ERROR_PRIVILEGE_NOT_HELD dwErr = GetLastError(); fprintf(stderr, "VirtualAlloc failed: dwError=%d\n", (int)dwErr) +; exit(1); } printf("VirtualAlloc ok p=%p\n", pMem); return pMem; }

Note that SeLockMemoryPrivilege ("Lock pages in memory") is required to use large pages. Also, obtaining 4GB of contiguous space may fail after the system has been running for a while due to memory fragmentation (to work around this, I typically fired up my search program right after a reboot).

For Oracle Databases, using HugePages reduces the operating system maintenance of page states, and increases Translation Lookaside Buffer (TLB) hit ratio.

-- Oracle Database Administrator's Reference (Unix and Linux)

You can similarly increase the page size on Linux, as indicated in the Oracle documentation above. In my googling on this topic, I discovered that all the big database vendors have an option to use Large/Huge pages, indicating that this is an important performance optimization, especially when addressing large amounts of memory.

oiskuu kindly messaged me this interesting research paper (pdf) describing how to hack direct memory segments into the kernel. So further TLB performance improvements are certainly possible.

Memory alignment can also impact performance, so when in doubt, I aligned memory on a 64 byte boundary.

The straightforward large page change above reduced the running time by a further 10 seconds, down to 58 seconds = 58 * 125**4 / 60 / 60 / 24 / 365 = 449 years.

Vectorization

Intel Advanced Vector Extensions (Intel AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on Intel architecture CPUs

-- Introduction to Intel Advanced Vector Extensions

As a further optimization, I applied some MSVC AVX compiler intrinsics, namely _mm_set1_epi32, _mm_xor_si128 and _mm_mullo_epi32, as shown below:

#include <emmintrin.h> #include <smmintrin.h> #define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; int inner( const unsigned char* bytevecM, int startval, int endval, int m2, int d2, int c2, int l2, int x2, my_m128_t* ps) { __m128i s2 = _mm_set_epi32(x2, l2, c2, d2); __m128i hp = _mm_set1_epi32(H_PRIME); __m128i s3, s4, s5, s6, s7; int m3, m4, m5, m6, m7; int d9; int c9; int l9; int x9; int q3, q4, q5, q6, q7, q8, q9; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10 || q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; s3 = _mm_mullo_epi32(_mm_xor_si128(s2, _mm_set1_epi32(q3)), hp); for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10 || q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; s4 = _mm_mullo_epi32(_mm_xor_si128(s3, _mm_set1_epi32(q4)), hp); for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10 || q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; s5 = _mm_mullo_epi32(_mm_xor_si128(s4, _mm_set1_epi32(q5)), hp +); for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10 || q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; s6 = _mm_mullo_epi32(_mm_xor_si128(s5, _mm_set1_epi32(q6)), +hp); for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)) +, hp); _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM +_HINT_T0); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)] +, _MM_HINT_T0); for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10 || q8 == 13) continue; q9 = bytevecM[(unsigned int)(m7 ^ q8)]; if (q9 == 0) continue; d9 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 != 500) continue; c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } return iret; }
Examination of the generated assembler shows that application of these compiler intrinsics replaced four separate xor and imul instructions on 32-bit registers with one vpxor and one vpmulld instruction that operate simultaneously on four 32-bit values in a 128-bit XMM register.

This data level parallelism gained a further three seconds, reducing the the estimated running time to 425 years.

Loop unwinding

Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size (space-time tradeoff).

The goal of loop unwinding is to increase a program's speed by reducing (or eliminating) instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration; reducing branch penalties; as well as "hiding latencies, in particular, the delay in reading data from memory". To eliminate this overhead, loops can be re-written as a repeated sequence of similar independent statements.

-- Loop unwinding (wikipedia)

As a further optimization, I performed the tedious yet important "inner loop unwinding optimization" by defining:

#define UNROLL(qx) \ q9 = bytevecM[(unsigned int)(m7 ^ qx)]; \ if (q9 != 0) { \ d9 = (s7.m128i_i32[0] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \ d9 %= 1001; if (d9 < 0) d9 += 1001; \ if (d9 == 500) { \ c9 = (s7.m128i_i32[1] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \ c9 %= 1001; if (c9 < 0) c9 += 1001; \ if (c9 == 100) { \ l9 = (s7.m128i_i32[2] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \ l9 %= 1001; if (l9 < 0) l9 += 1001; \ if (l9 == 50) { \ x9 = (s7.m128i_i32[3] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \ x9 %= 1001; if (x9 < 0) x9 += 1001; \ if (x9 == 10) { \ ps[iret].m128i_i16[0] = q3; \ ps[iret].m128i_i16[1] = q4; \ ps[iret].m128i_i16[2] = q5; \ ps[iret].m128i_i16[3] = q6; \ ps[iret].m128i_i16[4] = q7; \ ps[iret].m128i_i16[5] = qx; \ ps[iret].m128i_i16[6] = q9; \ ++iret; \ } \ } \ } \ } \ }
then adjusting the inner loop from:
for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10 || q8 == 13) continue; q9 = bytevecM[(unsigned int)(m7 ^ q8)]; if (q9 == 0) continue; d9 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 != 500) continue; c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; }
to first, the conservative:
UNROLL(1) UNROLL(2) UNROLL(3) UNROLL(4) UNROLL(5) UNROLL(6) UNROLL(7) UNROLL(8) UNROLL(9) UNROLL(11) UNROLL(12) for (q8 = 14; q8 < 128; ++q8) { UNROLL(q8) }
and then the more radical:
UNROLL(1) UNROLL(2) UNROLL(3) ... UNROLL(127)
All three variations have to be tried and you cannot know which will be fastest without benchmarking all three. The more you unroll, the bigger the code, which can slow down the instruction cache. On the other hand, if you don't slow that down, the code is faster because repeated loop tests are eliminated.

The conservative unwinding above gained seven seconds, while the radical one actually made the program slower.

To summarize then, our estimated running time was reduced in this node from 728 years down to 371 years (update: 286 years now). Though making such an improvement was certainly pleasurable, that is still nowhere near fast enough. How to find that elusive order of magnitude speed improvement? Find out in the next (and final) episode.

References

Update 12-May: Minor improvements to wording. Minor changes to code: removed unused bytevecD, bytevecC, bytevecL, bytevecX arguments to inner() function; added (unsigned int) cast to m7 in _mm_prefetch() call. Update 16-May: improved final running time from 371 to 286 years.

Comment on The 10**21 Problem (Part 3)
Select or Download Code
Re: The 10**21 Problem (Part 3)
by oiskuu (Pilgrim) on May 12, 2014 at 21:26 UTC

    A few points, if I may:

    • On modern linux configured with transparent hugepages, you (likely) get hugepages for big segments without any special code or handling.
    • Your 1st conservative unrolling is more accurately called loop peeling. The second form is complete loop peeling. To unroll a loop you merge N*loop bodies and iterate for (i=0; i<len; i += N) { BODY(i); BODY(i+1); ... }
    • I expect very little benefits from unrolling the complex inner loop. (But certainly I'd test a 2x unroll. Sometimes weird effects may be at work.)
    • Random memory access with hugepages is still on the order of 200 clocks or so. This time would be well spent on calculations. I'd therefore start with a few iterations of type a (=calculated), then switch over to type b (=lookup) once the prefetch has completed. Either loop might be unrolled.
    • Better still: prefetch the data needed for next loop iteration, that is vecM[m7next]. It will then be ready at the right time. (*)
    • Unrolling benefits may not carry over when hyperthreading is accounted for.
    • The hashing step+modulus involves a chain of 3 muls, meaning latency > 3*4+some. Loop iterations may overlap, except quarter of the tests vs 500+-127 will mispredict. This suggests another option of calculating a temporary q8_to_d9[128] in a tight loop (that allows good overlap), followed by a loop with the tests.
    • The q8_to_d9[] is a good candidate for vectorization. So is q9_to_c9[]. Inbetween, grep from d9 to q9. I think vectorized code will beat the lookup variant. Or not. Lookup+vectorized q9_to_c9 might be better. (*)
    • Test with different compilers. gcc, clang.

    All in all, this problem seems like a nice little assignment for a competent cryptanalyst.

    Modulo cmps. Since (2**31)%1000003 == 477207, we know we may safely add small values to the hashval without fear of sign change or overflow. This allows one to rewrite tests such as
        v %= 1001; if (v < 0) v += 1001; if (v != 100) continue;
    in a simpler, speedier form:
        v += 1001 - 100; if (v % 1001) continue;

    (*) Updated.

      Thanks for the great ideas to try. I suspect you work in this domain in your day job, is that right? In any case, you seem to have a lot more experience in HPC than me.

      I was able to quickly test one of your excellent tips:

      Better still: prefetch the data needed for next loop iteration, that is vecM[m7next]. It will then be ready at the right time.
      Simply changing the inner loop from:
      for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp); _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0 +); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI +NT_T0); UNROLL(1) UNROLL(2) UNROLL(3) UNROLL(4) UNROLL(5) UNROLL(6) UNROLL(7) UNROLL(8) UNROLL(9) UNROLL(11) UNROLL(12) for (q8 = 14; q8 < 128; ++q8) { UNROLL(q8) } }
      to two inner loops:
      for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; m7arr[q7] = m7; _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0 +); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI +NT_T0); } for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = m7arr[q7]; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp); UNROLL(1) UNROLL(2) UNROLL(3) UNROLL(4) UNROLL(5) UNROLL(6) UNROLL(7) UNROLL(8) UNROLL(9) UNROLL(11) UNROLL(12) for (q8 = 14; q8 < 128; ++q8) { UNROLL(q8) } }
      reduced the running time by 11 seconds, from 48 seconds to 37 seconds (371 to 286 years). The cost of the extra loop and storing 128 m7 values in an array is well worth it to reduce memory latency.

      Update: This variation was 43 seconds, six seconds slower:

      for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; m7arr[q7] = m7; _mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0 +); _mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HI +NT_T0); s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)), hp); s7arr[q7] = s7; } for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = m7arr[q7]; s7 = s7arr[q7]; // ... same UNROLL as before }

        Prefetching is tricky. I wouldn't try so many of them together. The number of outstanding requests is limited. How about this:

        // candidates for vectorization, lets break them apart static inline void q7_to_m7(int m7[], int m6) { int q, m6x = m6 ^ 1; for (q = 0; q < 128; q += 2) { // unroll by 2 m7[q] = (m6 ^ q) * H_PRIME; m7[q+1] = (m6x ^ q) * H_PRIME; } } static inline void prefetch_m(unsigned int i) { _mm_prefetch(&bytevecM[i], _MM_HINT_T0); _mm_prefetch(&bytevecM[i^64], _MM_HINT_T0); } ... prefetch_m((m6^1) * H_PRIME); int m7arr[130]; q7_to_m7(m7arr, m6); // fixup for prefetching two iterations ahead m7arr[129] = m7arr[128] = m7arr[127]; m7arr[13] = m7arr[15]; m7arr[10] = m7arr[12]; prefetch_m(m7arr[2]); for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; prefetch_m(m7arr[q7+2]); m7 = m7arr[q7]; ... }

Re: The 10**21 Problem (Part 3)
by oiskuu (Pilgrim) on May 16, 2014 at 12:35 UTC

    Some further analysis.

    Consider the last hashing step. Xor with q8 was a ±127 adjustment at most. Multiplied by hp=1000003, the difference is less than 2**27. It is therefore unlikely that a change in q8 would carry into sign of d8. Now, hp % 1001 == 4, which is like a shift. In a sense, all the magic bits have a weight of a power of two.

    > perl -e 'print join q( ), map 1000003*(1<<$_) % 1001 % 4, (0..7)' 0 0 0 0 0 0 0 0
    What this means is q8 can be stepped without altering 2 lowest result bits, provided d8 does not change sign and d >= 4*step && d < 1001-4*step.

    Update: Actually, scratch that. I forgot q8 selects q9, which of course toggles the low bits. There might be a way to set further restrictions on allowed q8's, but this is getting pretty convoluted.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1085685]
Approved by lidden
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (17)
As of 2014-07-28 16:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (204 votes), past polls