There is no point looping over q9. Instead, notice that as q9 takes all possible values from 0..127, m9 will take the values from (m8&~127)..(m8&~127)+127 in a different order. Among those 127 possible values of m9, there can only be one that has the right value modulo 1001. You can compute that value directly, and try the corresponding single value for q9, checking if it lies in the range.
Part I of this new series of articles on high performance computing ended with a teaser:
Can we eliminate the innermost loop somehow?
Our resident Perl Monks mathematician, ambrus, answered this question with an emphatic yes, with what I cannot restrain myself from dubbing the first theorem of ambrus. Not satisfied with that, he composed a second theorem:
...it could use any modulus between 1001 and 9999. To search for the magic formula in that case, first compute t = gcd(m9  1000, d9  500) (be careful with underflows). Then compute the small prime factors of t, and using this, all divisors of t between 1001 and 9999, and check all those divisors as the modulus against the rest of the numbers. In the rare lucky case that t is 0, you will have to try all moduluses (or all divisors of c9  100).
Given my maths is weak, I am grateful to ambrus for his mathematical advice and wish him the best in his career as a mathematician, hoping one day he will be remembered with something as famous as ambrus' last theorem. :)
The first theorem of ambrus
Compared to the beautiful flower that is ambrus' first theorem, my implementation of it (inner2.c) looks like an ugly weed:
Comparing to inner1.c, discussed last time, you will notice the entire innermost q9 loop is now gone! The observant reader will have further noticed that I shortened and inlined the old py_mod function (see this stackoverflow question for more details on calculating modulos that contain negative numbers). Oh, and I also only bothered to calculate the first five Roman numbers (M, D, C, L, X) because hitting these five is rare enough that you can easily check with a separate program whether the last two (V and I) match.#define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; int inner( int startval, int endval, int m2, int d2, int c2, int l2, int x2, int v2, int i2, my_m128_t* ps) { int m3, m4, m5, m6, m7, m9; int d3, d4, d5, d6, d7, d8, d9; int c3, c4, c5, c6, c7, c9; int l3, l4, l5, l6, l7, l9; int x3, x4, x5, x6, x7, x9; int q3, q4, q5, q6, q7, q8, q9; int dist; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10  q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; d3 = (d2 ^ q3) * H_PRIME; c3 = (c2 ^ q3) * H_PRIME; l3 = (l2 ^ q3) * H_PRIME; x3 = (x2 ^ q3) * H_PRIME; for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10  q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; d4 = (d3 ^ q4) * H_PRIME; c4 = (c3 ^ q4) * H_PRIME; l4 = (l3 ^ q4) * H_PRIME; x4 = (x3 ^ q4) * H_PRIME; for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10  q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; d5 = (d4 ^ q5) * H_PRIME; c5 = (c4 ^ q5) * H_PRIME; l5 = (l4 ^ q5) * H_PRIME; x5 = (x4 ^ q5) * H_PRIME; for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10  q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; d6 = (d5 ^ q6) * H_PRIME; c6 = (c5 ^ q6) * H_PRIME; l6 = (l5 ^ q6) * H_PRIME; x6 = (x5 ^ q6) * H_PRIME; for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10  q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; d7 = (d6 ^ q7) * H_PRIME; c7 = (c6 ^ q7) * H_PRIME; l7 = (l6 ^ q7) * H_PRIME; x7 = (x6 ^ q7) * H_PRIME; for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10  q8 == 13) continue; d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500  127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500  d9 : d9  500; q9 = (d8 & 127) ^ dist; if (q9 == 0  q9 == 10  q9 == 13) continue; c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; if (m9 == 1) m9 = 2; m9 %= 1001; if (m9 < 0) m9 += 1001; if (m9 != 1000) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } return iret; }
Since this inner2.c code is just plain ANSI C, you can try it out for yourself by combining with find1.cpp given in episode one. For example, using gcc on Linux:
Or Visual C++ on Windows:gcc O3 c inner2.c g++ o find1 find1.cpp inner2.o
An example run on my (Haswell 4770Kpowered) home PC is shown below:cl /W3 /MD /O2 find1.cpp inner2.c
where 9 is the value of the outermost loop, and [8,9), [60,61), and [1,2) are the semi open ranges to search in the next three outer loops, with all other inner loops searched exhaustively.# find1 9 8 9 60 61 1 2 9: sv1=8 ev1=9 sv2=60 ev2=61 sv3=1 ev3=2: 9 8 60 1 94 6 51 38 78 100 (wall clock time:109 secs, cpu time:108.95 units)
From that short run, we can estimate the total required running time as:
The 125 in the formula above is used because we search the values 1..9, 11, 12, 14..127, a total of 125 distinct values.109 * 125**4 = 26,611,328,125 seconds = 26611328125 / 60 / 60 / 24 / 3 +65 = 844 years
So, removing the innermost loop, via the first theorem of ambrus, reduced the estimated running time from 62,860 years down to 844 years, a 75fold improvement! Yay!
Intel Instrinsics
Intel intrinsic instructions are C style functions that provide access to many Intel instructions  including SSE, AVX, and more  without the need to write assembly code.
If necessary, and as a last resort, I was prepared to go to assembler for the inner loop. After all, that is why I restructured the code in the first place. In my reading, I stumbled across Intel intrinsics, which seemed a more convenient way to explore Intelspecific instructions without needing to write any assembler. In particular, the repetitive blocks like:
seemed ripe for vectorization. So, to gain some experience with using these new (to me) compiler instrinsics, I adjusted the above inner2.c code to use Intel AVX instructions, as shown below:m4 = (m3 ^ q4) * H_PRIME; d4 = (d3 ^ q4) * H_PRIME; c4 = (c3 ^ q4) * H_PRIME; l4 = (l3 ^ q4) * H_PRIME; x4 = (x3 ^ q4) * H_PRIME;
#include <emmintrin.h> #include <smmintrin.h> #define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; int inner( int startval, int endval, int m2, int d2, int c2, int l2, int x2, int v2, int i2, my_m128_t* ps) { __m128i s2 = _mm_set_epi32(x2, l2, c2, d2); __m128i hp = _mm_set1_epi32(H_PRIME); __m128i s3, s4, s5, s6, s7; int m3, m4, m5, m6, m7, m9; int d8, d9; int c9; int l9; int x9; int q3, q4, q5, q6, q7, q8, q9; int dist; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10  q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; s3 = _mm_mullo_epi32(_mm_xor_si128(s2, _mm_set1_epi32(q3)), hp); for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10  q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; s4 = _mm_mullo_epi32(_mm_xor_si128(s3, _mm_set1_epi32(q4)), hp); for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10  q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; s5 = _mm_mullo_epi32(_mm_xor_si128(s4, _mm_set1_epi32(q5)), hp +); for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10  q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; s6 = _mm_mullo_epi32(_mm_xor_si128(s5, _mm_set1_epi32(q6)), +hp); for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10  q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)) +, hp); for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10  q8 == 13) continue; d8 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500  127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500  d9 : d9  500; q9 = (d8 & 127) ^ dist; if (q9 == 0  q9 == 10  q9 == 13) continue; c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; if (m9 == 1) m9 = 2; m9 %= 1001; if (m9 < 0) m9 += 1001; if (m9 != 1000) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } return iret; }
The MSVC AVX compiler intrinsics applied above are _mm_set1_epi32, _mm_xor_si128 and _mm_mullo_epi32. Examination of the generated assembler showed that application of these intrinsics replaced four separate xor and imul instructions on 32bit registers with a single vpxor and vpmulld instruction operating simultaneously on four 32bit values in a 128bit XMM register, aka "Single Instruction Multiple Data" (SIMD).
Curiously, this AVX enabled program ran about eight seconds slower than the original, while the same AVX enabling of the program described in the next section ran about eight seconds faster, presumably due to the very different inner loops. In any case, the above vectorization is unlikely to make a big difference because it is used only in the outer loops, not the crucial inner loop. In the next episode, we will meet an Intel instruction (prefetch) that makes a bigger performance impact.
Memoization
Memoization is an optimization technique used primarily to speed up computer programs by keeping the results of expensive function calls and returning the cached result when the same inputs occur again
Being more computer programmer than mathematician, I eliminated the inner loop via a different method to ambrus: Memoization.
After staring for hours at that damned inner loop, it occurred to me that there are "only" 4GB distinct values (with each value in the range 0..127) for each Roman letter! This 4GB limit depends on Python's hash function generating 32bit values, not 64bit ones (update: see later 32 vs 64 bit Python hash function analysis).
Assuming only 4GB distinct values, we can precalculate them in a separate program (shown below), stuffing them into five separate 4GB tables, one for each of M, D, C, L, and X. Nowadays, many home PCs (including mine) have 32 GB of memory and a 64bit OS, so this 20 GB memory requirement is not really extravagant any more.
To constrast the two approaches, here is "the first theorem of ambrus" inner loop:
while here is "the first memory hack of eyepopslikeamosquito" inner loop:d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500  127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500  d9 : d9  500; q9 = (d8 & 127) ^ dist; if (q9 == 0  q9 == 10  q9 == 13) continue; c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; if (m9 == 1) m9 = 2; m9 %= 1001; if (m9 < 0) m9 += 1001; if (m9 != 1000) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret;
where bytevecM, bytevecD, bytevecC, bytevecL, and bytevecX are byte arrays of 4GB in size each (with a value of zero used to indicate "no hit").q9 = bytevecM[(unsigned int)(m7 ^ q8)]; if (q9 != 0 && q9 == bytevecD[(unsigned int)(d7 ^ q8)] && q9 == bytevecC[(unsigned int)(c7 ^ q8)] && q9 == bytevecL[(unsigned int)(l7 ^ q8)] && q9 == bytevecX[(unsigned int)(x7 ^ q8)]) { ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; }
Now, comparing the two sets of inner loop code, the memory hack has to be a lot faster, right? After all, it replaces quite a bit of calculation with just two table lookups (in the common case). I remembered, from years ago, that ctype.h is typically implemented via a 256byte lookup table, presumably because it is fast. Table lookups are fast. Or so I thought.
Enough theory, let's try it:
Hmmm, 94 seconds is only a modest improvement over 109. What's going on?# find3a 9 8 9 60 61 1 2 9: sv1=8 ev1=9 sv2=60 ev2=61 sv3=1 ev3=2: 9 8 60 1 94 6 51 38 78 100 (wall clock time:94 secs, cpu time:93.90 units)
To find out, I profiled the code with Intel VTune ... and was horrified to learn that the processor was spending nearly all its time sitting around doing nothing, just waiting for memory! Why?
Modern Memory Architecture
As CPU cores become both faster and more numerous, the limiting factor for most programs is now, and will be for some time, memory access. ... Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers.
People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like.
 Donald Knuth
Like many computer programmers, I had a mediocre understanding of how memory hardware works on a modern PC. By the way, the best reference I have found so far on this topic is What Every Programmer Should Know About Memory (pdf) by Ulrich Drepper. If you know of a better one, please let us know.
Despite considerable study, my understanding of modern memory architectures has only improved from mediocre to poor. You see, as CPU speeds have kept relentlessly increasing faster than memory speeds, memory architectures have grown more and more complicated, adding more and more cache levels and layers, to the point of melting my brain. The typical modern PC has at least:
Since modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory, we need to organize things so that the CPU has something to do while waiting for the cache line to load. Otherwise, we have a CPU stall. And that is what VTune was telling me I had. It further warned me of a high rate of TLB cache misses.
After this basic memory research, it dawned on me that I had a classic locality of reference problem.
Because each memory access into my (huge) 4GB lookup tables was essentially random, most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache, then waited for the cache line to be loaded into all three caches from main memory, while often incurring a TLB cache miss, just to rub salt into the wounds. Again and again and again. Aargh! What to do?
Well, an obvious fix is to find an algorithm that accesses the 4GB lookup tables sequentially. Unable to find such an algorithm, I resorted to yet more hacks, described in the next installment.
By the way, the mindboggling cost of cache misses with modern CPUs has finally overturned the conventional C++ wisdom that linked lists should be preferred to vectors when performing many insertions and deletions in the middle of the list. The new wisdom is simply: just use vector. :) That is, vector's superior data locality trumps linked list in just about all known scenarios.
Two Oxen or 1024 Chickens?
If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?
 Seymour Cray
The nice thing about the "first theorem of ambrus" code is that it is just plain, simple C code, with modest memory requirements, that can run anywhere. So it is very suitable for massive parallelization via OpenCL/ GPGPU/ FPGA, for example. That is, this approach is well suited to employing 1024 chickens.
With its huge memory requirements, use of Intel intrinsics, plus OS support, the approach I took to solving this problem was essentially that of harnessing two strong oxen. I took that approach only due to the steep learning curve of OpenCL/GPGPU/FPGA, which I have never used. I would like to find the time in the future to pursue the 1024 chickens approach with an OpenCL/GPGPU/FPGA implementation of the second theorem of ambrus.
References
Updated: Minor improvements to wording. In the "Memoization" section, added link to reply below that contains a program to generate the lookup table.


Replies are listed 'Best First'.  

Re: The 10**21 Problem (Part 2)
by zentara (Archbishop) on May 02, 2014 at 14:56 UTC  
by chacham (Prior) on May 05, 2014 at 18:28 UTC  
Re: The 10**21 Problem (Part 2)
by eyepopslikeamosquito (Bishop) on May 18, 2014 at 04:07 UTC  
by oiskuu (Hermit) on May 18, 2014 at 09:21 UTC 