There is no point looping over q9.
Instead, notice that as q9 takes all possible values from 0..127, m9 will take the values from (m8&~127)..(m8&~127)+127 in a different order.
Among those 127 possible values of m9, there can only be one that has the right value modulo 1001.
You can compute that value directly, and try the corresponding single value for q9, checking if it lies in the range.
-- ambrus' first theorem
Part I of this new series of articles on high performance computing ended with a teaser:
Can we eliminate the innermost loop somehow?
Our resident Perl Monks mathematician, ambrus, answered this question
with an emphatic yes,
with what I cannot restrain myself from dubbing the first theorem of ambrus.
Not satisfied with that, he composed a second theorem:
...it could use any modulus between 1001 and 9999.
To search for the magic formula in that case, first compute t = gcd(m9 - 1000, d9 - 500) (be careful with underflows).
Then compute the small prime factors of t, and using this, all divisors of t between 1001 and 9999, and check all
those divisors as the modulus against the rest of the numbers.
In the rare lucky case that t is 0, you will have to try all moduluses (or all divisors of c9 - 100).
-- ambrus' second theorem
Given my maths is weak, I am grateful to ambrus for his mathematical advice
and wish him the best in his career as a mathematician, hoping one day he will be remembered
with something as famous as ambrus' last theorem. :-)
The first theorem of ambrus
Compared to the beautiful flower that is ambrus' first theorem, my
implementation of it (inner2.c) looks like an ugly weed:
#define H_PRIME 1000003
#define HASH_LEN 11
// my_m128_t mimics Intel __m128i type
typedef struct my_m128_t my_m128_t;
struct my_m128_t { short m128i_i16[8]; };
int inner(
int startval, int endval,
int m2, int d2, int c2, int l2, int x2, int v2, int i2,
my_m128_t* ps)
{
int m3, m4, m5, m6, m7, m9;
int d3, d4, d5, d6, d7, d8, d9;
int c3, c4, c5, c6, c7, c9;
int l3, l4, l5, l6, l7, l9;
int x3, x4, x5, x6, x7, x9;
int q3, q4, q5, q6, q7, q8, q9;
int dist;
int iret = 0;
for (q3 = startval; q3 < endval; ++q3) {
if (q3 == 10 || q3 == 13) continue;
m3 = (m2 ^ q3) * H_PRIME;
d3 = (d2 ^ q3) * H_PRIME;
c3 = (c2 ^ q3) * H_PRIME;
l3 = (l2 ^ q3) * H_PRIME;
x3 = (x2 ^ q3) * H_PRIME;
for (q4 = 1; q4 < 128; ++q4) {
if (q4 == 10 || q4 == 13) continue;
m4 = (m3 ^ q4) * H_PRIME;
d4 = (d3 ^ q4) * H_PRIME;
c4 = (c3 ^ q4) * H_PRIME;
l4 = (l3 ^ q4) * H_PRIME;
x4 = (x3 ^ q4) * H_PRIME;
for (q5 = 1; q5 < 128; ++q5) {
if (q5 == 10 || q5 == 13) continue;
m5 = (m4 ^ q5) * H_PRIME;
d5 = (d4 ^ q5) * H_PRIME;
c5 = (c4 ^ q5) * H_PRIME;
l5 = (l4 ^ q5) * H_PRIME;
x5 = (x4 ^ q5) * H_PRIME;
for (q6 = 1; q6 < 128; ++q6) {
if (q6 == 10 || q6 == 13) continue;
m6 = (m5 ^ q6) * H_PRIME;
d6 = (d5 ^ q6) * H_PRIME;
c6 = (c5 ^ q6) * H_PRIME;
l6 = (l5 ^ q6) * H_PRIME;
x6 = (x5 ^ q6) * H_PRIME;
for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
d7 = (d6 ^ q7) * H_PRIME;
c7 = (c6 ^ q7) * H_PRIME;
l7 = (l6 ^ q7) * H_PRIME;
x7 = (x6 ^ q7) * H_PRIME;
for (q8 = 1; q8 < 128; ++q8) {
if (q8 == 10 || q8 == 13) continue;
d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN;
d9 = d8 & ~127;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 < 500 - 127) continue;
if (d9 > 500 + 127) continue;
dist = d9 < 500 ? 500 - d9 : d9 - 500;
q9 = (d8 & 127) ^ dist;
if (q9 == 0 || q9 == 10 || q9 == 13) continue;
c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
l9 %= 1001; if (l9 < 0) l9 += 1001;
if (l9 != 50) continue;
x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
x9 %= 1001; if (x9 < 0) x9 += 1001;
if (x9 != 10) continue;
m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
if (m9 == -1) m9 = -2;
m9 %= 1001; if (m9 < 0) m9 += 1001;
if (m9 != 1000) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
}
}
}
}
}
return iret;
}
Comparing to
inner1.c, discussed
last time, you will notice
the entire innermost
q9 loop is now gone!
The observant reader will have further noticed that I shortened
and inlined the old
py_mod function
(see
this stackoverflow question
for more details on calculating modulos that contain negative numbers).
Oh, and I also only bothered to calculate the first five Roman numbers (M, D, C, L, X)
because hitting these five is rare enough that you can easily check with a separate
program whether the last two (V and I) match.
Since this inner2.c code is just plain ANSI C, you can try it out for yourself
by combining with find1.cpp given in episode one.
For example, using gcc on Linux:
gcc -O3 -c inner2.c
g++ -o find1 find1.cpp inner2.o
Or Visual C++ on Windows:
cl /W3 /MD /O2 find1.cpp inner2.c
An example run on my (
Haswell 4770K-powered) home PC
is shown below:
# find1 9 8 9 60 61 1 2
9: sv1=8 ev1=9 sv2=60 ev2=61 sv3=1 ev3=2:
9 8 60 1 94 6 51 38 78 100
(wall clock time:109 secs, cpu time:108.95 units)
where 9 is the value of the outermost loop, and [8,9), [60,61), and [1,2) are the
semi open ranges
to search in the next three outer loops, with all other inner loops searched exhaustively.
From that short run, we can estimate the total required running time as:
109 * 125**4 = 26,611,328,125 seconds = 26611328125 / 60 / 60 / 24 / 3
+65 = 844 years
The 125 in the formula above is used because we search the values
1..9, 11, 12, 14..127, a total of 125 distinct values.
So, removing the innermost loop, via the first theorem of ambrus,
reduced the estimated running time from 62,860 years down to 844 years, a 75-fold improvement!
Yay!
Intel Instrinsics
Intel intrinsic instructions are C style functions that provide access to many Intel
instructions -- including SSE, AVX, and more -- without the need to write assembly code.
-- Intel Instrinsics Guide
If necessary, and as a last resort, I was prepared to go to assembler for the inner loop.
After all, that is why I restructured the code in the first place.
In my reading, I stumbled across Intel intrinsics, which seemed a more convenient way
to explore Intel-specific instructions without needing to write any assembler.
In particular, the repetitive blocks like:
m4 = (m3 ^ q4) * H_PRIME;
d4 = (d3 ^ q4) * H_PRIME;
c4 = (c3 ^ q4) * H_PRIME;
l4 = (l3 ^ q4) * H_PRIME;
x4 = (x3 ^ q4) * H_PRIME;
seemed ripe for vectorization.
So, to gain some experience with using these new (to me) compiler instrinsics, I adjusted
the above
inner2.c code to use Intel AVX instructions, as shown below:
#include <emmintrin.h>
#include <smmintrin.h>
#define H_PRIME 1000003
#define HASH_LEN 11
// my_m128_t mimics Intel __m128i type
typedef struct my_m128_t my_m128_t;
struct my_m128_t { short m128i_i16[8]; };
int inner(
int startval, int endval,
int m2, int d2, int c2, int l2, int x2, int v2, int i2,
my_m128_t* ps)
{
__m128i s2 = _mm_set_epi32(x2, l2, c2, d2);
__m128i hp = _mm_set1_epi32(H_PRIME);
__m128i s3, s4, s5, s6, s7;
int m3, m4, m5, m6, m7, m9;
int d8, d9;
int c9;
int l9;
int x9;
int q3, q4, q5, q6, q7, q8, q9;
int dist;
int iret = 0;
for (q3 = startval; q3 < endval; ++q3) {
if (q3 == 10 || q3 == 13) continue;
m3 = (m2 ^ q3) * H_PRIME;
s3 = _mm_mullo_epi32(_mm_xor_si128(s2, _mm_set1_epi32(q3)), hp);
for (q4 = 1; q4 < 128; ++q4) {
if (q4 == 10 || q4 == 13) continue;
m4 = (m3 ^ q4) * H_PRIME;
s4 = _mm_mullo_epi32(_mm_xor_si128(s3, _mm_set1_epi32(q4)), hp);
for (q5 = 1; q5 < 128; ++q5) {
if (q5 == 10 || q5 == 13) continue;
m5 = (m4 ^ q5) * H_PRIME;
s5 = _mm_mullo_epi32(_mm_xor_si128(s4, _mm_set1_epi32(q5)), hp
+);
for (q6 = 1; q6 < 128; ++q6) {
if (q6 == 10 || q6 == 13) continue;
m6 = (m5 ^ q6) * H_PRIME;
s6 = _mm_mullo_epi32(_mm_xor_si128(s5, _mm_set1_epi32(q6)),
+hp);
for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7))
+, hp);
for (q8 = 1; q8 < 128; ++q8) {
if (q8 == 10 || q8 == 13) continue;
d8 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN;
d9 = d8 & ~127;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 < 500 - 127) continue;
if (d9 > 500 + 127) continue;
dist = d9 < 500 ? 500 - d9 : d9 - 500;
q9 = (d8 & 127) ^ dist;
if (q9 == 0 || q9 == 10 || q9 == 13) continue;
c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
l9 %= 1001; if (l9 < 0) l9 += 1001;
if (l9 != 50) continue;
x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
x9 %= 1001; if (x9 < 0) x9 += 1001;
if (x9 != 10) continue;
m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
if (m9 == -1) m9 = -2;
m9 %= 1001; if (m9 < 0) m9 += 1001;
if (m9 != 1000) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
}
}
}
}
}
return iret;
}
The MSVC AVX compiler intrinsics applied above are _mm_set1_epi32, _mm_xor_si128 and _mm_mullo_epi32.
Examination of the generated assembler showed that
application of these intrinsics
replaced four separate xor and imul instructions
on 32-bit registers with a single vpxor and vpmulld
instruction operating simultaneously on four 32-bit values
in a 128-bit XMM register, aka "Single Instruction Multiple Data" (SIMD).
Curiously, this AVX enabled program ran about eight seconds slower than the original,
while the same AVX enabling of the program described in the next section
ran about eight seconds faster, presumably due to the
very different inner loops.
In any case, the above vectorization is unlikely to make a big difference because it is used only in the outer loops, not the crucial inner loop.
In the next episode, we will meet an Intel instruction (prefetch) that makes a bigger performance impact.
Memoization
Memoization is an optimization technique used primarily to speed up computer programs
by keeping the results of expensive function calls and returning the cached result when
the same inputs occur again
-- Memoization (wikipedia)
Being more computer programmer than mathematician, I eliminated the inner
loop via a different method to ambrus: Memoization.
After staring for hours at that damned inner loop, it occurred to me that there
are "only" 4GB distinct values (with each value in the range 0..127) for each Roman letter!
This 4GB limit depends on Python's hash function
generating 32-bit values, not 64-bit ones
(update: see later 32 vs 64 bit Python hash function analysis).
Assuming only 4GB distinct values, we can pre-calculate
them in a separate program (shown below),
stuffing them into five separate 4GB tables, one for each of M, D, C, L, and X.
Nowadays, many home PCs (including mine) have 32 GB of memory and a 64-bit OS,
so this 20 GB memory requirement is not really extravagant any more.
To constrast the two approaches, here is "the first theorem of ambrus" inner loop:
d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN;
d9 = d8 & ~127;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 < 500 - 127) continue;
if (d9 > 500 + 127) continue;
dist = d9 < 500 ? 500 - d9 : d9 - 500;
q9 = (d8 & 127) ^ dist;
if (q9 == 0 || q9 == 10 || q9 == 13) continue;
c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
l9 %= 1001; if (l9 < 0) l9 += 1001;
if (l9 != 50) continue;
x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
x9 %= 1001; if (x9 < 0) x9 += 1001;
if (x9 != 10) continue;
m9 = (m7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
if (m9 == -1) m9 = -2;
m9 %= 1001; if (m9 < 0) m9 += 1001;
if (m9 != 1000) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
while here is "the first memory hack of eyepopslikeamosquito" inner loop:
q9 = bytevecM[(unsigned int)(m7 ^ q8)];
if (q9 != 0
&& q9 == bytevecD[(unsigned int)(d7 ^ q8)]
&& q9 == bytevecC[(unsigned int)(c7 ^ q8)]
&& q9 == bytevecL[(unsigned int)(l7 ^ q8)]
&& q9 == bytevecX[(unsigned int)(x7 ^ q8)]) {
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
where
bytevecM,
bytevecD,
bytevecC,
bytevecL, and
bytevecX are
byte arrays of 4GB in size each (with a value of zero used to indicate "no hit").
Now, comparing the two sets of inner loop code, the memory hack has to be a lot faster, right?
After all, it replaces quite a bit of calculation with just two table lookups (in the common case).
I remembered, from years ago, that ctype.h
is typically implemented via a 256-byte lookup table, presumably because it is fast.
Table lookups are fast.
Or so I thought.
Enough theory, let's try it:
# find3a 9 8 9 60 61 1 2
9: sv1=8 ev1=9 sv2=60 ev2=61 sv3=1 ev3=2:
9 8 60 1 94 6 51 38 78 100
(wall clock time:94 secs, cpu time:93.90 units)
Hmmm, 94 seconds is only a modest improvement over 109. What's going on?
To find out, I profiled the code with Intel VTune ... and was horrified to learn
that the processor was spending nearly all its time sitting around doing nothing,
just waiting for memory! Why?
Modern Memory Architecture
As CPU cores become both faster and more numerous, the limiting factor for most programs
is now, and will be for some time, memory access. ...
Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer
or the caches on CPUs is well understood by most programmers.
-- Ulrich Drepper
People who are more than casually interested in computers should have
at least some idea of what the underlying hardware is like.
-- Donald Knuth
Like many computer programmers, I had a mediocre understanding
of how memory hardware works on a modern PC.
By the way, the best reference I have found so far on this topic is
What Every Programmer Should Know About Memory (pdf)
by Ulrich Drepper.
If you know of a better one, please let us know.
Despite considerable study, my understanding of modern memory architectures
has only improved from mediocre to poor.
You see, as CPU speeds have kept relentlessly increasing faster than memory speeds,
memory architectures have grown more and more complicated,
adding more and more cache levels and layers,
to the point of melting my brain.
The typical modern PC has at least:
- An instruction cache to speed up instruction fetch.
- A data cache to speed up data fetch and store. Nowadays, this is usually a three-level cache: L1, L2, and L3.
- A translation lookaside buffer (TLB) to speed up virtual-to-physical address translation. This cache is also typically multi-level.
Note that data is transferred between memory and cache in blocks of fixed size,
called cache lines, typically 64 bytes in size.
All loads and stores go through the cache.
If the cache line containing the requested byte is not already in the cache,
its cache line is read into cache, replacing an existing cache line.
Since modern CPUs can execute hundreds of instructions in the time taken
to fetch a single cache line from main memory, we need to organize
things so that the CPU has something to do while waiting for the
cache line to load.
Otherwise, we have a CPU stall.
And that is what VTune was telling me I had.
It further warned me of a high rate of TLB cache misses.
After this basic memory research, it dawned on me that I had a
classic locality of reference
problem.
Because each memory access into my (huge) 4GB lookup tables was essentially random,
most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache,
then waited for the cache line to be loaded into all three caches from main memory,
while often incurring a TLB cache miss, just to rub salt into the wounds. Again and again and again. Aargh!
What to do?
Well, an obvious fix is to find an algorithm that accesses the 4GB lookup
tables sequentially.
Unable to find such an algorithm, I resorted to yet more hacks,
described in the next installment.
By the way, the mind-boggling cost of cache misses with modern CPUs
has finally overturned the conventional C++ wisdom that linked lists
should be preferred to vectors when performing many insertions and
deletions in the middle of the list.
The new wisdom is simply: just use vector. :-)
That is, vector's superior data locality trumps linked list
in just about all known scenarios.
Two Oxen or 1024 Chickens?
If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?
-- Seymour Cray
The nice thing about the "first theorem of ambrus" code is that it is just plain, simple C code,
with modest memory requirements, that can run anywhere. So it is very suitable for massive
parallelization via CUDA/
OpenCL/
GPGPU/
FPGA, for example.
That is, this approach is well suited to employing 1024 chickens.
With its huge memory requirements, use of Intel intrinsics, plus OS support, the approach I took
to solving this problem was essentially that of harnessing two strong oxen.
I took that approach only due to the steep learning curve of CUDA/OpenCL/GPGPU/FPGA, which I have
never used. I would like to find the time in the future to pursue the 1024 chickens approach
with an CUDA/OpenCL/GPGPU/FPGA implementation of the second theorem of ambrus.
References
Updated: Minor improvements to wording. In the "Memoization" section, added link to reply below that contains a program to generate the lookup table.