Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

There is no point looping over q9. Instead, notice that as q9 takes all possible values from 0..127, m9 will take the values from (m8&~127)..(m8&~127)+127 in a different order. Among those 127 possible values of m9, there can only be one that has the right value modulo 1001. You can compute that value directly, and try the corresponding single value for q9, checking if it lies in the range.

-- ambrus' first theorem

Part I of this new series of articles on high performance computing ended with a teaser:

Can we eliminate the innermost loop somehow?

Our resident Perl Monks mathematician, ambrus, answered this question with an emphatic yes, with what I cannot restrain myself from dubbing the first theorem of ambrus. Not satisfied with that, he composed a second theorem:

...it could use any modulus between 1001 and 9999. To search for the magic formula in that case, first compute t = gcd(m9 - 1000, d9 - 500) (be careful with underflows). Then compute the small prime factors of t, and using this, all divisors of t between 1001 and 9999, and check all those divisors as the modulus against the rest of the numbers. In the rare lucky case that t is 0, you will have to try all moduluses (or all divisors of c9 - 100).

-- ambrus' second theorem

Given my maths is weak, I am grateful to ambrus for his mathematical advice and wish him the best in his career as a mathematician, hoping one day he will be remembered with something as famous as ambrus' last theorem. :-)

The first theorem of ambrus

Compared to the beautiful flower that is ambrus' first theorem, my implementation of it (inner2.c) looks like an ugly weed:

#define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; int inner( int startval, int endval, int m2, int d2, int c2, int l2, int x2, int v2, int i2, my_m128_t* ps) { int m3, m4, m5, m6, m7, m9; int d3, d4, d5, d6, d7, d8, d9; int c3, c4, c5, c6, c7, c9; int l3, l4, l5, l6, l7, l9; int x3, x4, x5, x6, x7, x9; int q3, q4, q5, q6, q7, q8, q9; int dist; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10 || q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; d3 = (d2 ^ q3) * H_PRIME; c3 = (c2 ^ q3) * H_PRIME; l3 = (l2 ^ q3) * H_PRIME; x3 = (x2 ^ q3) * H_PRIME; for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10 || q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; d4 = (d3 ^ q4) * H_PRIME; c4 = (c3 ^ q4) * H_PRIME; l4 = (l3 ^ q4) * H_PRIME; x4 = (x3 ^ q4) * H_PRIME; for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10 || q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; d5 = (d4 ^ q5) * H_PRIME; c5 = (c4 ^ q5) * H_PRIME; l5 = (l4 ^ q5) * H_PRIME; x5 = (x4 ^ q5) * H_PRIME; for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10 || q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; d6 = (d5 ^ q6) * H_PRIME; c6 = (c5 ^ q6) * H_PRIME; l6 = (l5 ^ q6) * H_PRIME; x6 = (x5 ^ q6) * H_PRIME; for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; d7 = (d6 ^ q7) * H_PRIME; c7 = (c6 ^ q7) * H_PRIME; l7 = (l6 ^ q7) * H_PRIME; x7 = (x6 ^ q7) * H_PRIME; for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10 || q8 == 13) continue; d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500 - 127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500 - d9 : d9 - 500; q9 = (d8 & 127) ^ dist; if (q9 == 0 || q9 == 10 || q9 == 13) continue; c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; if (m9 == -1) m9 = -2; m9 %= 1001; if (m9 < 0) m9 += 1001; if (m9 != 1000) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } return iret; }
Comparing to inner1.c, discussed last time, you will notice the entire innermost q9 loop is now gone! The observant reader will have further noticed that I shortened and inlined the old py_mod function (see this stackoverflow question for more details on calculating modulos that contain negative numbers). Oh, and I also only bothered to calculate the first five Roman numbers (M, D, C, L, X) because hitting these five is rare enough that you can easily check with a separate program whether the last two (V and I) match.

Since this inner2.c code is just plain ANSI C, you can try it out for yourself by combining with find1.cpp given in episode one. For example, using gcc on Linux:

gcc -O3 -c inner2.c g++ -o find1 find1.cpp inner2.o
Or Visual C++ on Windows:
cl /W3 /MD /O2 find1.cpp inner2.c
An example run on my (Haswell 4770K-powered) home PC is shown below:
# find1 9 8 9 60 61 1 2 9: sv1=8 ev1=9 sv2=60 ev2=61 sv3=1 ev3=2: 9 8 60 1 94 6 51 38 78 100 (wall clock time:109 secs, cpu time:108.95 units)
where 9 is the value of the outermost loop, and [8,9), [60,61), and [1,2) are the semi open ranges to search in the next three outer loops, with all other inner loops searched exhaustively.

From that short run, we can estimate the total required running time as:

109 * 125**4 = 26,611,328,125 seconds = 26611328125 / 60 / 60 / 24 / 3 +65 = 844 years
The 125 in the formula above is used because we search the values 1..9, 11, 12, 14..127, a total of 125 distinct values.

So, removing the innermost loop, via the first theorem of ambrus, reduced the estimated running time from 62,860 years down to 844 years, a 75-fold improvement! Yay!

Intel Instrinsics

Intel intrinsic instructions are C style functions that provide access to many Intel instructions -- including SSE, AVX, and more -- without the need to write assembly code.

-- Intel Instrinsics Guide

If necessary, and as a last resort, I was prepared to go to assembler for the inner loop. After all, that is why I restructured the code in the first place. In my reading, I stumbled across Intel intrinsics, which seemed a more convenient way to explore Intel-specific instructions without needing to write any assembler. In particular, the repetitive blocks like:

m4 = (m3 ^ q4) * H_PRIME; d4 = (d3 ^ q4) * H_PRIME; c4 = (c3 ^ q4) * H_PRIME; l4 = (l3 ^ q4) * H_PRIME; x4 = (x3 ^ q4) * H_PRIME;
seemed ripe for vectorization. So, to gain some experience with using these new (to me) compiler instrinsics, I adjusted the above inner2.c code to use Intel AVX instructions, as shown below:
#include <emmintrin.h> #include <smmintrin.h> #define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; int inner( int startval, int endval, int m2, int d2, int c2, int l2, int x2, int v2, int i2, my_m128_t* ps) { __m128i s2 = _mm_set_epi32(x2, l2, c2, d2); __m128i hp = _mm_set1_epi32(H_PRIME); __m128i s3, s4, s5, s6, s7; int m3, m4, m5, m6, m7, m9; int d8, d9; int c9; int l9; int x9; int q3, q4, q5, q6, q7, q8, q9; int dist; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10 || q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; s3 = _mm_mullo_epi32(_mm_xor_si128(s2, _mm_set1_epi32(q3)), hp); for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10 || q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; s4 = _mm_mullo_epi32(_mm_xor_si128(s3, _mm_set1_epi32(q4)), hp); for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10 || q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; s5 = _mm_mullo_epi32(_mm_xor_si128(s4, _mm_set1_epi32(q5)), hp +); for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10 || q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; s6 = _mm_mullo_epi32(_mm_xor_si128(s5, _mm_set1_epi32(q6)), +hp); for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7)) +, hp); for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10 || q8 == 13) continue; d8 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500 - 127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500 - d9 : d9 - 500; q9 = (d8 & 127) ^ dist; if (q9 == 0 || q9 == 10 || q9 == 13) continue; c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; if (m9 == -1) m9 = -2; m9 %= 1001; if (m9 < 0) m9 += 1001; if (m9 != 1000) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } return iret; }

The MSVC AVX compiler intrinsics applied above are _mm_set1_epi32, _mm_xor_si128 and _mm_mullo_epi32. Examination of the generated assembler showed that application of these intrinsics replaced four separate xor and imul instructions on 32-bit registers with a single vpxor and vpmulld instruction operating simultaneously on four 32-bit values in a 128-bit XMM register, aka "Single Instruction Multiple Data" (SIMD).

Curiously, this AVX enabled program ran about eight seconds slower than the original, while the same AVX enabling of the program described in the next section ran about eight seconds faster, presumably due to the very different inner loops. In any case, the above vectorization is unlikely to make a big difference because it is used only in the outer loops, not the crucial inner loop. In the next episode, we will meet an Intel instruction (prefetch) that makes a bigger performance impact.

Memoization

Memoization is an optimization technique used primarily to speed up computer programs by keeping the results of expensive function calls and returning the cached result when the same inputs occur again

-- Memoization (wikipedia)

Being more computer programmer than mathematician, I eliminated the inner loop via a different method to ambrus: Memoization.

After staring for hours at that damned inner loop, it occurred to me that there are "only" 4GB distinct values (with each value in the range 0..127) for each Roman letter! This 4GB limit depends on Python's hash function generating 32-bit values, not 64-bit ones (update: see later 32 vs 64 bit Python hash function analysis).

Assuming only 4GB distinct values, we can pre-calculate them in a separate program (shown below), stuffing them into five separate 4GB tables, one for each of M, D, C, L, and X. Nowadays, many home PCs (including mine) have 32 GB of memory and a 64-bit OS, so this 20 GB memory requirement is not really extravagant any more.

To constrast the two approaches, here is "the first theorem of ambrus" inner loop:

d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN; d9 = d8 & ~127; d9 %= 1001; if (d9 < 0) d9 += 1001; if (d9 < 500 - 127) continue; if (d9 > 500 + 127) continue; dist = d9 < 500 ? 500 - d9 : d9 - 500; q9 = (d8 & 127) ^ dist; if (q9 == 0 || q9 == 10 || q9 == 13) continue; c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; c9 %= 1001; if (c9 < 0) c9 += 1001; if (c9 != 100) continue; l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; l9 %= 1001; if (l9 < 0) l9 += 1001; if (l9 != 50) continue; x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; x9 %= 1001; if (x9 < 0) x9 += 1001; if (x9 != 10) continue; m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9; if (m9 == -1) m9 = -2; m9 %= 1001; if (m9 < 0) m9 += 1001; if (m9 != 1000) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret;
while here is "the first memory hack of eyepopslikeamosquito" inner loop:
q9 = bytevecM[(unsigned int)(m7 ^ q8)]; if (q9 != 0 && q9 == bytevecD[(unsigned int)(d7 ^ q8)] && q9 == bytevecC[(unsigned int)(c7 ^ q8)] && q9 == bytevecL[(unsigned int)(l7 ^ q8)] && q9 == bytevecX[(unsigned int)(x7 ^ q8)]) { ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; }
where bytevecM, bytevecD, bytevecC, bytevecL, and bytevecX are byte arrays of 4GB in size each (with a value of zero used to indicate "no hit").

Now, comparing the two sets of inner loop code, the memory hack has to be a lot faster, right? After all, it replaces quite a bit of calculation with just two table lookups (in the common case). I remembered, from years ago, that ctype.h is typically implemented via a 256-byte lookup table, presumably because it is fast. Table lookups are fast. Or so I thought.

Enough theory, let's try it:

# find3a 9 8 9 60 61 1 2 9: sv1=8 ev1=9 sv2=60 ev2=61 sv3=1 ev3=2: 9 8 60 1 94 6 51 38 78 100 (wall clock time:94 secs, cpu time:93.90 units)
Hmmm, 94 seconds is only a modest improvement over 109. What's going on?

To find out, I profiled the code with Intel VTune ... and was horrified to learn that the processor was spending nearly all its time sitting around doing nothing, just waiting for memory! Why?

Modern Memory Architecture

As CPU cores become both faster and more numerous, the limiting factor for most programs is now, and will be for some time, memory access. ... Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers.

-- Ulrich Drepper

People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like.

-- Donald Knuth

Like many computer programmers, I had a mediocre understanding of how memory hardware works on a modern PC. By the way, the best reference I have found so far on this topic is What Every Programmer Should Know About Memory (pdf) by Ulrich Drepper. If you know of a better one, please let us know.

Despite considerable study, my understanding of modern memory architectures has only improved from mediocre to poor. You see, as CPU speeds have kept relentlessly increasing faster than memory speeds, memory architectures have grown more and more complicated, adding more and more cache levels and layers, to the point of melting my brain. The typical modern PC has at least:

  • An instruction cache to speed up instruction fetch.
  • A data cache to speed up data fetch and store. Nowadays, this is usually a three-level cache: L1, L2, and L3.
  • A translation lookaside buffer (TLB) to speed up virtual-to-physical address translation. This cache is also typically multi-level.
Note that data is transferred between memory and cache in blocks of fixed size, called cache lines, typically 64 bytes in size. All loads and stores go through the cache. If the cache line containing the requested byte is not already in the cache, its cache line is read into cache, replacing an existing cache line.

Since modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory, we need to organize things so that the CPU has something to do while waiting for the cache line to load. Otherwise, we have a CPU stall. And that is what VTune was telling me I had. It further warned me of a high rate of TLB cache misses.

After this basic memory research, it dawned on me that I had a classic locality of reference problem.

Because each memory access into my (huge) 4GB lookup tables was essentially random, most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache, then waited for the cache line to be loaded into all three caches from main memory, while often incurring a TLB cache miss, just to rub salt into the wounds. Again and again and again. Aargh! What to do?

Well, an obvious fix is to find an algorithm that accesses the 4GB lookup tables sequentially. Unable to find such an algorithm, I resorted to yet more hacks, described in the next installment.

By the way, the mind-boggling cost of cache misses with modern CPUs has finally overturned the conventional C++ wisdom that linked lists should be preferred to vectors when performing many insertions and deletions in the middle of the list. The new wisdom is simply: just use vector. :-) That is, vector's superior data locality trumps linked list in just about all known scenarios.

Two Oxen or 1024 Chickens?

If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?

-- Seymour Cray

The nice thing about the "first theorem of ambrus" code is that it is just plain, simple C code, with modest memory requirements, that can run anywhere. So it is very suitable for massive parallelization via CUDA/ OpenCL/ GPGPU/ FPGA, for example. That is, this approach is well suited to employing 1024 chickens.

With its huge memory requirements, use of Intel intrinsics, plus OS support, the approach I took to solving this problem was essentially that of harnessing two strong oxen. I took that approach only due to the steep learning curve of CUDA/OpenCL/GPGPU/FPGA, which I have never used. I would like to find the time in the future to pursue the 1024 chickens approach with an CUDA/OpenCL/GPGPU/FPGA implementation of the second theorem of ambrus.

References

Updated: Minor improvements to wording. In the "Memoization" section, added link to reply below that contains a program to generate the lookup table.


In reply to The 10**21 Problem (Part 2) by eyepopslikeamosquito

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-03-19 11:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found