To recap, in parts one and two we reduced the estimated
running time of a magic formula search program from 52,887,477 years down to just 728 years.
Though that was gratifying, I really didn't want to wait that long.
I was happy to wait a year or two, but no more.
So I had to find a way to somehow make the search
program run a hundred times faster.
That's not a minor disadvantage, we're talking about things
being 50 to 100 times slower with a linked list.
Compactness matters. Vectors are more compact than lists.
And predictable usage patterns matter enormously.
With a vector, you have to shove a lot of elements over,
but caches are really really good at that.
When you traverse lists you keep doing random access ...
so you are actually random accessing your memory and you
are maximizing your cache misses, which is exactly the
opposite of what you want.
-- Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (from GoingNative 2012 Keynote)
At the end of Part 2, you may recall that we had become
stalled by a nasty "cache miss" performance problem,
similar to the one Stroustrup describes above.
By following Stroustrup's advice to
"stay compact, stay predictable" -- and without changing the essential algorithm --
I was able to eventually achieve the massive speed up required.
Sometimes you think a code change "should" improve performance yet it does not.
Each specific code "performance improvement" therefore has to be benchmarked separately,
accepted if it improves performance, backed out if not. Painstaking work.
One step at a time ... one insight leads to another ... just like playing code golf. :-)
Lookup Table vs Calculation
In Part 1 we noticed, when playing golf,
there is often a battle between a lookup table and a magic formula.
Curiously, when searching for a code golf magic formula,
there is often a similar battle,
this time between a lookup table and a calculation:
for example, to eliminate the inner loop, ambrus used a calculation;
I used a lookup table.
When searching for our magic formula,
lookup table wins for the first Roman Numeral value,
it's a tie for the second value,
and calculation wins for all subsequent values.
Why?
Well, from Part 2, the code to calculate the first Roman Numeral value (D):
d8 = (d7 ^ q8) * H_PRIME ^ HASH_LEN;
d9 = d8 & ~127;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 < 500 - 127) continue;
if (d9 > 500 + 127) continue;
dist = d9 < 500 ? 500 - d9 : d9 - 500;
q9 = (d8 & 127) ^ dist;
if (q9 == 0 || q9 == 10 || q9 == 13) continue;
is significantly more work than the code to calculate the second and subsequent values namely:
c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
Apart from being more work, the first value always needs to be calculated,
while the second value only need be calculated about 255 times per 1001
(i.e. when
d9 above is in the range
373..627).
Since the second value matches only once per 1001
(i.e. when
c9 above is precisely
100),
the third value only need be calculated with a probability of (255/1001) * (1/1001),
the fourth value (255/1001) * (1/1001) * (1/1001), ... and so on.
Testing showed that hitting the third and subsequent values (i.e. L, X, M) is
a sufficiently rare event that, with current commodity hardware, it is not worth
cluttering all the memory caches with their 4GB lookup tables.
Of course, whether lookup table is faster than calculation
depends on CPU versus memory speed.
And main memory speed can be improved by spending a lot of money.
At one stage, I was considering boosting memory performance by buying
expensive high speed memory and overclocking my PC.
By the way, a nice bonus from reducing five 4GB lookup tables
down to just one is that the program's total memory requirement
is now much lower, allowing us to run it in certain
massively parallel environments, such as
Xeon Phi for example
(which currently has a 6GB limit per core).
Here is the inner loop code revised to use just one 4GB lookup table:
#define H_PRIME 1000003
#define HASH_LEN 11
// my_m128_t mimics Intel __m128i type
typedef struct my_m128_t my_m128_t;
struct my_m128_t { short m128i_i16[8]; };
int inner(
const unsigned char* bytevecM,
int startval, int endval,
int m2, int d2, int c2, int l2, int x2,
my_m128_t* ps)
{
int m3, m4, m5, m6, m7;
int d3, d4, d5, d6, d7, d9;
int c3, c4, c5, c6, c7, c9;
int l3, l4, l5, l6, l7, l9;
int x3, x4, x5, x6, x7, x9;
int q3, q4, q5, q6, q7, q8, q9;
int iret = 0;
for (q3 = startval; q3 < endval; ++q3) {
if (q3 == 10 || q3 == 13) continue;
m3 = (m2 ^ q3) * H_PRIME;
d3 = (d2 ^ q3) * H_PRIME;
c3 = (c2 ^ q3) * H_PRIME;
l3 = (l2 ^ q3) * H_PRIME;
x3 = (x2 ^ q3) * H_PRIME;
for (q4 = 1; q4 < 128; ++q4) {
if (q4 == 10 || q4 == 13) continue;
m4 = (m3 ^ q4) * H_PRIME;
d4 = (d3 ^ q4) * H_PRIME;
c4 = (c3 ^ q4) * H_PRIME;
l4 = (l3 ^ q4) * H_PRIME;
x4 = (x3 ^ q4) * H_PRIME;
for (q5 = 1; q5 < 128; ++q5) {
if (q5 == 10 || q5 == 13) continue;
m5 = (m4 ^ q5) * H_PRIME;
d5 = (d4 ^ q5) * H_PRIME;
c5 = (c4 ^ q5) * H_PRIME;
l5 = (l4 ^ q5) * H_PRIME;
x5 = (x4 ^ q5) * H_PRIME;
for (q6 = 1; q6 < 128; ++q6) {
if (q6 == 10 || q6 == 13) continue;
m6 = (m5 ^ q6) * H_PRIME;
d6 = (d5 ^ q6) * H_PRIME;
c6 = (c5 ^ q6) * H_PRIME;
l6 = (l5 ^ q6) * H_PRIME;
x6 = (x5 ^ q6) * H_PRIME;
for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
d7 = (d6 ^ q7) * H_PRIME;
c7 = (c6 ^ q7) * H_PRIME;
l7 = (l6 ^ q7) * H_PRIME;
x7 = (x6 ^ q7) * H_PRIME;
for (q8 = 1; q8 < 128; ++q8) {
if (q8 == 10 || q8 == 13) continue;
q9 = bytevecM[(unsigned int)(m7 ^ q8)];
if (q9 == 0) continue;
d9 = (d7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 != 500) continue;
c9 = (c7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
l9 = (l7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
l9 %= 1001; if (l9 < 0) l9 += 1001;
if (l9 != 50) continue;
x9 = (x7 ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
x9 %= 1001; if (x9 < 0) x9 += 1001;
if (x9 != 10) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
}
}
}
}
}
return iret;
}
As you can see, we simply combined the "first theorem of ambrus" code and the
"first hack of eyepopslikeamosquito" code from
Part 2.
Doing this reduced the running time from 95 to 74 seconds = 74 * 125**4 / 60 / 60 / 24 / 365 = 572 years.
Prefetching
In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly
managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware
and software, is among our most important available techniques for doing so.
-- When Prefetching Works, When it Doesn't, and Why (pdf)
What more can be done to speed up our (newly single) 4GB lookup table?
Well, before we enter the inner loop, we already know the limited
range of values required from the bytevecM lookup table.
In fact, just as in ambrus's first theorem, there are only 128 values,
and they are all right next to each other in memory.
To reduce memory latency, can we somehow tell the computer to start
the tortuous load sequence of those 128 bytes from main memory before we enter the inner loop? Yes!
The MSVC _mm_prefetch
compiler instrinsic, which under the covers generates the x86 SSE SIMD prefetcht0 instruction,
does just that.
That is, adding the following code immmediately before the q8 loop above:
_mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM_HINT_T0);
_mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)], _MM_HINT_T
+0);
begins bringing into all CPU caches the two (64-byte) cache lines required by the q8 loop for the
bytevecM lookup table before the loop starts.
Note that
0xffffff80 above is just
~127 in disguise.
Adding this prefetch hack reduced the running time by 6 seconds to 68 seconds = 68 * 125**4 / 60 / 60 / 24 / 365 = 526 years. Update: this was later improved from 6 to 17 seconds.
TLB Performance
There are a couple of factors which influence TLB performance.
The first is the size of the pages.
Obviously, the larger a page is, the more instructions or data objects will fit into it.
So a larger page size reduces the overall number of address translations which are needed,
meaning that fewer entries in the TLB cache are needed.
Most architectures nowadays allow the use of multiple different page sizes;
some sizes can be used concurrently.
For instance, the x86/x86-64 processors have a normal page size of 4kB
but they can also use 4MB and 2MB pages respectively.
-- Ulrich Drepper
Larger page sizes mean that a TLB cache of the same size can keep track
of larger amounts of memory, which avoids costly TLB misses.
-- Page (wikipedia)
As indicated above, there is a simple way to significantly boost TLB performance:
just use large pages!
So, I simply increased the (Windows 8) Operating System page size via the
following Win32 code and switched from malloc to my_big_malloc below
when allocating memory for the 4GB lookup table.
Note that running this code requires Administrator privileges.
BOOL MySetLockPagesPrivilege(HANDLE hProcess, BOOL bEnable)
{
struct {
DWORD Count;
LUID_AND_ATTRIBUTES Privilege[1];
} Info;
HANDLE Token;
if ( !OpenProcessToken(hProcess, TOKEN_ADJUST_PRIVILEGES, &Token) )
+ {
printf("Cannot open process token.\n");
return FALSE;
}
Info.Count = 1;
if (bEnable) {
Info.Privilege[0].Attributes = SE_PRIVILEGE_ENABLED;
}
else {
Info.Privilege[0].Attributes = 0;
}
if ( !LookupPrivilegeValue(NULL, SE_LOCK_MEMORY_NAME, &(Info.Privil
+ege[0].Luid)) ) {
printf("Cannot get privilege for %s.\n", SE_LOCK_MEMORY_NAME);
return FALSE;
}
if ( !AdjustTokenPrivileges(Token, FALSE, (PTOKEN_PRIVILEGES)&Info,
+ 0, NULL, NULL) ) {
printf ("Cannot adjust token privileges (%u)\n", GetLastError())
+;
return FALSE;
}
// ERROR_NOT_ALL_ASSIGNED
// 1300 (0x514)
// Not all privileges or groups referenced are assigned to the call
+er.
DWORD dwErr = GetLastError();
printf("AdjustTokenPrivileges dwErr=%u\n", dwErr);
if (dwErr != ERROR_SUCCESS) {
printf("Cannot enable the SE_LOCK_MEMORY_NAME privilege");
return FALSE;
}
CloseHandle(Token);
return TRUE;
}
void LargeMemInit()
{
if ( !MySetLockPagesPrivilege(GetCurrentProcess(), TRUE) ) {
fprintf(stderr, "MySetLockPagesPrivilege failed.\n");
exit(1);
}
}
// VirtualAlloc may fail with: ERROR_NO_SYSTEM_RESOURCES = 1450. To fi
+x, reboot.
static void* my_big_malloc(size_t s)
{
DWORD dwErr;
void* pMem;
pMem = VirtualAlloc(
NULL, // Let system decide where to al
+locate it
s, // size of memory block
MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES,
PAGE_READWRITE
);
if (pMem == NULL) {
// Error 1314 = ERROR_PRIVILEGE_NOT_HELD
dwErr = GetLastError();
fprintf(stderr, "VirtualAlloc failed: dwError=%d\n", (int)dwErr)
+;
exit(1);
}
printf("VirtualAlloc ok p=%p\n", pMem);
return pMem;
}
Note that SeLockMemoryPrivilege ("Lock pages in memory") is required to use large pages.
Also, obtaining 4GB of contiguous space may fail after the system has been
running for a while due to memory fragmentation
(to work around this, I typically fired up my search program right after a reboot).
For Oracle Databases, using HugePages reduces the operating system maintenance of page states,
and increases Translation Lookaside Buffer (TLB) hit ratio.
-- Oracle Database Administrator's Reference (Unix and Linux)
You can similarly increase the page size on Linux, as indicated in the Oracle documentation above.
In my googling on this topic, I discovered that all the big database vendors have an option
to use Large/Huge pages, indicating that this is an important performance optimization,
especially when addressing large amounts of memory.
oiskuu kindly messaged me this interesting research paper (pdf)
describing how to hack direct memory segments into the kernel.
So further TLB performance improvements are certainly possible.
Memory alignment can also impact performance, so when in doubt, I aligned
memory on a 64 byte boundary.
The straightforward large page change above reduced the running time by a further 10 seconds, down to
58 seconds = 58 * 125**4 / 60 / 60 / 24 / 365 = 449 years.
Vectorization
Intel Advanced Vector Extensions (Intel AVX) is a set of instructions for
doing Single Instruction Multiple Data (SIMD) operations on Intel architecture CPUs
-- Introduction to Intel Advanced Vector Extensions
As a further optimization, I applied some MSVC AVX compiler intrinsics,
namely _mm_set1_epi32, _mm_xor_si128 and _mm_mullo_epi32,
as shown below:
#include <emmintrin.h>
#include <smmintrin.h>
#define H_PRIME 1000003
#define HASH_LEN 11
// my_m128_t mimics Intel __m128i type
typedef struct my_m128_t my_m128_t;
struct my_m128_t { short m128i_i16[8]; };
int inner(
const unsigned char* bytevecM,
int startval, int endval,
int m2, int d2, int c2, int l2, int x2,
my_m128_t* ps)
{
__m128i s2 = _mm_set_epi32(x2, l2, c2, d2);
__m128i hp = _mm_set1_epi32(H_PRIME);
__m128i s3, s4, s5, s6, s7;
int m3, m4, m5, m6, m7;
int d9;
int c9;
int l9;
int x9;
int q3, q4, q5, q6, q7, q8, q9;
int iret = 0;
for (q3 = startval; q3 < endval; ++q3) {
if (q3 == 10 || q3 == 13) continue;
m3 = (m2 ^ q3) * H_PRIME;
s3 = _mm_mullo_epi32(_mm_xor_si128(s2, _mm_set1_epi32(q3)), hp);
for (q4 = 1; q4 < 128; ++q4) {
if (q4 == 10 || q4 == 13) continue;
m4 = (m3 ^ q4) * H_PRIME;
s4 = _mm_mullo_epi32(_mm_xor_si128(s3, _mm_set1_epi32(q4)), hp);
for (q5 = 1; q5 < 128; ++q5) {
if (q5 == 10 || q5 == 13) continue;
m5 = (m4 ^ q5) * H_PRIME;
s5 = _mm_mullo_epi32(_mm_xor_si128(s4, _mm_set1_epi32(q5)), hp
+);
for (q6 = 1; q6 < 128; ++q6) {
if (q6 == 10 || q6 == 13) continue;
m6 = (m5 ^ q6) * H_PRIME;
s6 = _mm_mullo_epi32(_mm_xor_si128(s5, _mm_set1_epi32(q6)),
+hp);
for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
s7 = _mm_mullo_epi32(_mm_xor_si128(s6, _mm_set1_epi32(q7))
+, hp);
_mm_prefetch(&bytevecM[(unsigned int)m7 & 0xffffff80], _MM
+_HINT_T0);
_mm_prefetch(&bytevecM[64+((unsigned int)m7 & 0xffffff80)]
+, _MM_HINT_T0);
for (q8 = 1; q8 < 128; ++q8) {
if (q8 == 10 || q8 == 13) continue;
q9 = bytevecM[(unsigned int)(m7 ^ q8)];
if (q9 == 0) continue;
d9 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 != 500) continue;
c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
l9 %= 1001; if (l9 < 0) l9 += 1001;
if (l9 != 50) continue;
x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
x9 %= 1001; if (x9 < 0) x9 += 1001;
if (x9 != 10) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
}
}
}
}
}
return iret;
}
Examination of the generated assembler shows that application of these compiler intrinsics
replaced four separate
xor and
imul instructions on 32-bit registers
with one
vpxor and one
vpmulld instruction that operate
simultaneously on four 32-bit values in a 128-bit XMM register.
This data level parallelism gained a further three seconds, reducing the
the estimated running time to 425 years.
Loop unwinding
Loop unwinding, also known as loop unrolling, is a loop transformation
technique that attempts to optimize a program's execution speed at the
expense of its binary size (space-time tradeoff).
The goal of loop unwinding is to increase a program's speed by reducing (or eliminating)
instructions that control the loop, such as pointer arithmetic and "end of loop" tests
on each iteration; reducing branch penalties; as well as "hiding latencies,
in particular, the delay in reading data from memory".
To eliminate this overhead, loops can be re-written
as a repeated sequence of similar independent statements.
-- Loop unwinding (wikipedia)
As a further optimization, I performed the tedious yet important "inner loop unwinding optimization"
by defining:
#define UNROLL(qx) \
q9 = bytevecM[(unsigned int)(m7 ^ qx)]; \
if (q9 != 0) { \
d9 = (s7.m128i_i32[0] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \
d9 %= 1001; if (d9 < 0) d9 += 1001; \
if (d9 == 500) { \
c9 = (s7.m128i_i32[1] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \
c9 %= 1001; if (c9 < 0) c9 += 1001; \
if (c9 == 100) { \
l9 = (s7.m128i_i32[2] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \
l9 %= 1001; if (l9 < 0) l9 += 1001; \
if (l9 == 50) { \
x9 = (s7.m128i_i32[3] ^ qx) * H_PRIME ^ HASH_LEN ^ q9; \
x9 %= 1001; if (x9 < 0) x9 += 1001; \
if (x9 == 10) { \
ps[iret].m128i_i16[0] = q3; \
ps[iret].m128i_i16[1] = q4; \
ps[iret].m128i_i16[2] = q5; \
ps[iret].m128i_i16[3] = q6; \
ps[iret].m128i_i16[4] = q7; \
ps[iret].m128i_i16[5] = qx; \
ps[iret].m128i_i16[6] = q9; \
++iret; \
} \
} \
} \
} \
}
then adjusting the inner loop from:
for (q8 = 1; q8 < 128; ++q8) {
if (q8 == 10 || q8 == 13) continue;
q9 = bytevecM[(unsigned int)(m7 ^ q8)];
if (q9 == 0) continue;
d9 = (s7.m128i_i32[0] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
d9 %= 1001; if (d9 < 0) d9 += 1001;
if (d9 != 500) continue;
c9 = (s7.m128i_i32[1] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
c9 %= 1001; if (c9 < 0) c9 += 1001;
if (c9 != 100) continue;
l9 = (s7.m128i_i32[2] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
l9 %= 1001; if (l9 < 0) l9 += 1001;
if (l9 != 50) continue;
x9 = (s7.m128i_i32[3] ^ q8) * H_PRIME ^ HASH_LEN ^ q9;
x9 %= 1001; if (x9 < 0) x9 += 1001;
if (x9 != 10) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
to first, the conservative:
UNROLL(1)
UNROLL(2)
UNROLL(3)
UNROLL(4)
UNROLL(5)
UNROLL(6)
UNROLL(7)
UNROLL(8)
UNROLL(9)
UNROLL(11)
UNROLL(12)
for (q8 = 14; q8 < 128; ++q8) {
UNROLL(q8)
}
and then the more radical:
UNROLL(1)
UNROLL(2)
UNROLL(3)
...
UNROLL(127)
All three variations have to be tried and you cannot know which will
be fastest without benchmarking all three.
The more you unroll, the bigger the code, which can slow down the
instruction cache. On the other hand, if you don't slow that down,
the code is faster because repeated loop tests are eliminated.
The conservative unwinding above gained seven seconds,
while the radical one actually made the program slower.
To summarize then, our estimated running time was reduced in this node from 728 years down to 371 years (update: 286 years now).
Though making such an improvement was certainly pleasurable, that is still nowhere near fast enough.
How to find that elusive order of magnitude speed improvement?
Find out in the next (and final) episode.
References
Update 12-May: Minor improvements to wording. Minor changes to code: removed unused bytevecD, bytevecC, bytevecL, bytevecX arguments
to inner() function; added (unsigned int) cast to m7 in _mm_prefetch() call. Update 16-May: improved final running time from 371 to 286 years.