When I looked for magic formulae, etc., I would give up after 20 minutes!
-- Jasper
Unlike Jasper, I have spent more than 20 minutes searching for magic formulae.
Especially ancient Roman ones.
I first became aware of them back in 2006 during
the Fonality Christmas Golf Challenge
where I was astonished by Ton's ingenuity and deviousness in constructing
his original HART (Hospelian Arabic to Roman Transform) magic formula.
Shortly after the Fonality game, codegolf.com hosted an endless competition
where you must convert the other way, from Roman to Decimal.
By 2009, I had managed to take the lead in all four languages
(Perl, Python, Ruby, PHP), as detailed
in this series of nodes.
And the winning solutions -- in all four languages --
all used (quite different) magic formulae!
To refresh your memory, the codegolf game rules were essentially:
Convert a Roman numeral to its integer value.
The input numeral, provided on stdin followed by a single newline, is
guaranteed to be in the range I to MMMCMXCIX (1 to 3999).
For example, given IV on stdin, your program should print 4 to stdout.
As you might expect, a key component of solutions to this game
is mapping individual Roman letters to their decimal equivalents.
To help us focus on that, let's define a simpler spec:
Write a function to convert a single Roman Numeral letter to its decimal equivalent.
The function assumes the Roman letter is in $_
and returns its decimal equivalent.
To clarify, here's a sample (non-golfed) Perl solution:
sub r {
my %h = ( I=>1, V=>5, X=>10, L=>50, C=>100, D=>500, M=>1000 );
return $h{$_};
}
print "$_: ", 0+r(), "\n" for (qw(I V X L C D M));
Running this program produces:
I: 1
V: 5
X: 10
L: 50
C: 100
D: 500
M: 1000
Lookup Table vs Magic Formula
In code golf, there is often a battle between a lookup table
and a magic formula. So it proved here.
When converting from Roman to Decimal, magic formula trumps lookup table.
Here are three attempts to solve this problem using a lookup table:
sub r {%h=(I,1,V,5,X,10,L,50,C,100,D,500,M,1000);$h{$_}}
sub r {M1000D500C100L50X10V5I1=~$_;$'}
sub r {M999D499C99L49X9V4I=~$_+$'}
And here are some of my earliest magic formulae from 2007:
sub r {5**y/VLD//.E.(3*/M/+2*/C|D/+/X|L/)}
sub r {1 .E.~-ord()*41%52%5>>y/VLD//}
sub r {5**y/VLD//.E.ord()x3%75%50%4}
sub r {1 .E.(72^ord)*5/7%5>>y/VLD//}
sub r {5**y/VLD//.E.(42^88*ord)%5}
sub r {1 .E.(3^77%ord)%7>>y/VLD//}
Back then, my magic formula searching skills were so poor that I
erroneously concluded that the lookup table approach was better. :-)
Later, when I tackled this problem in Python, I really needed to use
each Roman letter once only in the formula, which forced me to explore
alternative approaches ... which, in turn, led to still shorter Perl magic formulae,
such as:
sub r {10**(7&69303333/ord)%9995}
sub r {10**(7&5045e8/ord)%2857} # needs 64-bit Perl
sub r {IXCMVLD=~$_;"1E@-"%9995}
sub r {XCMVLD=~$_;"1E@+"%9995} # Grimy improvement
Back in the good old days, the little search programs that uncovered these
solutions took hours to run, not years. :-)
The long numbers (such as 69303333 above) that began popping up in these formulae
were an indication that the ord function didn't scale very well as
the required solutions became less probable.
Can we find a built-in function better suited to Roman magic formulae?
In PHP and Python, yes. In Perl and Ruby, probably not.
At least, I don't know of one.
The PHP built-in md5 function is perfect for Roman magic formulae.
Not only does it generate all required digits (0-9), it generates little else (just a-f).
Moreover, you can use all 256 characters in a magic formula,
compared to just ten (0-9) for ord.
To illustrate, in an eight character magic string (for example 69303333),
there are just 10**8 combinations available for ord, while there
are a mind-boggling 256**8=1.8*10**19 combinations available for md5!
This is a huge advantage when searching for highly improbable solutions,
as we shall see later.
The Python hash function is also superior to ord,
though not as good as PHP's md5. This is because Python's Unicode
and character escaping claptrap limits you to 125 characters
(namely ord 1..9,11,12,14..127) that can be used as input to the hash
function without special treatment.
Still, 125 is a huge improvement over 10!
One drawback of hash compared to md5 is that it generates
huge numbers, forcing you to waste five strokes with %1001
to trim the generated numbers into the required Roman Numeral range (1-1000).
In some cases, the Perl crypt built-in could be employed, though
it is not well-suited to this specific problem because (unlike md5)
it generates many other characters in addition to the desired 0-9.
To recap, the shortest magic formulae found so far in all four languages
(as detailed in this series of nodes) are:
10**(205558%ord(r)%7)%9995 # Python
hash(r+"magicstrng")%1001 # Python (finding magicstrng is the subjec
+t of this node!)
VLD=~$_*5+IXCM=~$_."E@-" # Perl
10**(7&5045e8/ord)%2857 # Perl (64-bit)
IXCMVLD=~$_;"1E@-"%9995 # Perl
XCMVLD=~$_;"1E@+"%9995 # Perl (update: Grimy improvement)
uppp&md5_hex$_.PQcUv # Perl (needs Digest::MD5 module)
10**(494254%r/9)%4999 # Ruby (no need for explicit ord in this g
+ame)
md5($r.magicstrng) # PHP (finding magicstrng is an unsolved p
+roblem)
md5($r.PQcUv)&uppp # PHP wins due to superb mf properties of
+md5
The 10**21 Problem
There's just no sane regularity in this thing. But in "random" mappings
with a very small result set like this, the shortest solution is often
to make up some magic formula that has no particular meaning, but just
happens to give the wanted result.
-- Ton Hospel explaining his original decimal-to-roman magic formula
When Ton Hospel invented magic formulae in golf way back in 2004,
he correctly noted that they work best "with a very small result set".
Indeed, if there were only five Roman Numeral letters, rather than seven, we would
have a straightforward 10**15 problem, rather than a (borderline intractable)
10**21 problem. Why 10**21?
Suppose you have a magic formula that produces a random number between 1 and
1000. The probability of scoring a (lucky) hit for one numeral is therefore
1 in 1000. Since probabilities multiply, the probability of hitting five
out of five numerals is 1 in 10**15, while the chance of hitting
all seven Roman Numerals is a daunting 1 in 10**21.
Though 1 in 10**21 sounds improbable in the extreme,
if you could generate 10**21 combinations, you would not need luck,
indeed you would expect to find such an improbable solution.
Yet how long would it take to search 10**21
combinations for the elusive (lucky) one?
Well, if you could perform 10,000 search operations per second, the time
required to search 10**21 combinations is a
mere 10**21/10000 seconds = 10**17 seconds = 3,170,979,198 years!
By the way, this "brute force search infeasibility problem" is why you
are asked to create longer and longer passwords, and with a wider range
of characters in them, as computer speeds improve.
Indeed, Password cracking
and magic formula searching are closely related disciplines,
the lack of a "codegolf magic formula" wikipedia page notwithstanding. :-)
To make this theoretical argument more concrete, consider searching
for a Roman to Decimal magic formula using the Python built-in
hash function.
Since this hash function produces very large values, we need to apply %1001
to it so as to produce an essentially random value in the desired 1..1000 range.
We might try searching for such a magic formula using a simple Python brute-force
search program, such as:
import time
print time.time(), time.clock()
for q0 in range(1, 128):
for q1 in range(1, 128):
for q2 in range(1, 128):
for q3 in range(1, 128):
for q4 in range(1, 128):
for q5 in range(1, 128):
for q6 in range(1, 128):
for q7 in range(1, 128):
for q8 in range(1, 128):
for q9 in range(1, 128):
magic = chr(q0)+chr(q1)+chr(q2)+chr(q3)+chr(q4)+chr(q5)+chr(
+q6)+chr(q7)+chr(q8)+chr(q9)
m = hash("M" + magic) % 1001
if m != 1000: continue
d = hash("D" + magic) % 1001
if d != 500: continue
c = hash("C" + magic) % 1001
if c != 100: continue
l = hash("L" + magic) % 1001
if l != 50: continue
x = hash("X" + magic) % 1001
if x != 10: continue
v = hash("V" + magic) % 1001
if v != 5: continue
i = hash("I" + magic) % 1001
if i != 1: continue
print "bingo!", q0, q1, q2, q3, q4, q5, q6, q7, q8, q9
print time.time(), time.clock()
On my machine, this naive brute force search for a ten-character magic
string is expected to find a solution in about 52,887,477 years!
Note that, with 127 different values for each character in the string,
we need a 10-character magic string to give us the required
127**10 = 1.1e21 = 10**21 combinations.
Rather than give up at this point, I found this
"impossible" 50 million year challenge
strangely irresistible ... and was eager to learn about any high performance computing techniques required to solve it.
I get it -- optimization is a fun game ... one can play all day with unrolling loops, peeling away layers of indirection, and so forth to gain cycles, while piddling away time and energy
-- davido
Or, as davido puts it, "optimization is a fun game".
Well, I enjoy it.
So I started by rewriting the above search program in C++, then kept on refining
it until I was able to reduce the running time from fifty million years down to
only several years and eventually find a solution.
This new series of articles describes that endeavour.
Please note that this series focuses more on High Performance Computing/Supercomputing/Parallel computing
in C/C++/Intel assembler than code golf.
First Attempt
Re-writing our crude Python searcher in C/C++ we get:
#include <stdio.h>
#include <time.h>
// Python hash() function. Assumes 32-bit int.
// Derived from string_hash() in stringobject.c in Python 2.5.1 source
+s.
static int py_hash(const char* a, int size)
{
int len = size;
const unsigned char* p = (unsigned char*)a;
int x = *p << 7;
while (--len >= 0) x = (1000003 * x) ^ *p++;
x ^= size;
if (x == -1) x = -2;
return x;
}
// Simulate python modulo operator. Assumes mod is positive.
static int py_mod(int j, int mod)
{
int ret;
if (j >= 0) {
ret = j % mod;
} else {
ret = j - (j / mod) * mod;
if (ret) ret += mod;
}
return ret;
}
int main()
{
int q0, q1, q2, q3, q4, q5, q6, q7, q8, q9;
int m, d, c, l, x, v, i;
char magic[16];
time_t tstart = time(NULL);
clock_t cstart = clock();
time_t tend;
clock_t cend;
if (sizeof(int) != 4) { fprintf(stderr, "oops sizeof int != 4\n");
+return 1; }
for (q0 = 1; q0 < 128; ++q0) {
magic[1] = q0;
for (q1 = 1; q1 < 128; ++q1) {
magic[2] = q1;
for (q2 = 1; q2 < 128; ++q2) {
magic[3] = q2;
for (q3 = 1; q3 < 128; ++q3) {
magic[4] = q3;
for (q4 = 1; q4 < 128; ++q4) {
magic[5] = q4;
for (q5 = 1; q5 < 128; ++q5) {
magic[6] = q5;
for (q6 = 1; q6 < 128; ++q6) {
magic[7] = q6;
for (q7 = 1; q7 < 128; ++q7) {
magic[8] = q7;
for (q8 = 1; q8 < 128; ++q8) {
magic[9] = q8;
for (q9 = 1; q9 < 128; ++q9) {
magic[10] = q9;
magic[0] = 'M'; m = py_mod(py_hash(magic, 11), 1001);
if (m != 1000) continue;
magic[0] = 'D'; d = py_mod(py_hash(magic, 11), 1001);
if (d != 500) continue;
magic[0] = 'C'; c = py_mod(py_hash(magic, 11), 1001);
if (c != 100) continue;
magic[0] = 'L'; l = py_mod(py_hash(magic, 11), 1001);
if (l != 50) continue;
magic[0] = 'X'; x = py_mod(py_hash(magic, 11), 1001);
if (x != 10) continue;
magic[0] = 'V'; v = py_mod(py_hash(magic, 11), 1001);
if (v != 5) continue;
magic[0] = 'I'; i = py_mod(py_hash(magic, 11), 1001);
if (i != 1) continue;
printf("bingo! %d %d %d %d %d %d %d %d %d %d\n",
q0, q1, q2, q3, q4, q5, q6, q7, q8, q9);
}
}
}
}
}
}
}
}
}
}
tend = time(NULL);
cend = clock();
printf("(wall clock time:%ld secs, cpu time:%.2f units)\n",
(long) (difftime(tend, tstart)+0.5),
(double) (cend-cstart) / (double)CLOCKS_PER_SEC);
return 0;
}
which ran about 190 times faster than the Python version.
Down to a running time of 275,530 years now. Woohoo!
Unroll the hash and reorg the code
An obvious optimization is to "unroll" the py_hash() function within the loop;
that is, incrementally build the hash value within the loop and remove the
py_hash() function.
Note, by the way, that this algorithm is very well-suited to multi-threading;
each thread can be given a different starting and ending range to search
and just go at it; there is no need for any thread synchronization at all.
So I took the opportunity at this stage to reorganize the code,
to factor out the "inner loop" into its own file with a single function (with
a C call interface) that just contains raw calculation, no global or static variables,
no library calls, no I/O. Just code.
This reorganization has a number of benefits:
- Easy to make the inner loop code thread-safe.
- Easy to call from multiple threads (having the "inner loop" function do no I/O simplifies this).
- Gives you the option to rewrite the "inner loop" file (in assembler, say) so long as you honor its interface.
- Easier to port. The code that launches threads will likely be different on each platform (e.g. POSIX vs Win32 threads), the I/O code may be different on each platform, while the inner loop code can be the same on all platforms (or tailored to suit).
- From now on, I can focus on the crucial inner loop code, and so have less boilerplate C++ code cluttering this meditation. :-)
Of course, I didn't want function call overhead to inhibit this reorganization,
so I arbitrarily made the new inner loop function perform the innermost seven loops.
After the code reorganization and hash function unrolling, here is the new main program, find1.cpp:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#define H_PRIME 1000003
// Most unlikely to get this many hits in a single call of the inner l
+oop function.
#define MAX_HITS 64
// my_m128_t mimics Intel __m128i type (which will be used later)
typedef struct my_m128_t my_m128_t;
struct my_m128_t { short m128i_i16[8]; };
// the inner loop function (lives in a separate file)
extern "C" int inner(
int startval, int endval,
int m2, int d2, int c2, int l2, int x2, int v2, int i2,
my_m128_t* ps
);
void do_one(int q0, int sv1, int ev1, int sv2, int ev2, int sv3, int e
+v3)
{
time_t tstart = time(NULL);
clock_t cstart = clock();
time_t tend;
clock_t cend;
my_m128_t soln[MAX_HITS];
int m1, m2;
int m_char = 'M';
int m0 = ( ( H_PRIME * (m_char << 7) ) ^ m_char ) * H_PRIME;
int d1, d2;
int d_char = 'D';
int d0 = ( ( H_PRIME * (d_char << 7) ) ^ d_char ) * H_PRIME;
int c1, c2;
int c_char = 'C';
int c0 = ( ( H_PRIME * (c_char << 7) ) ^ c_char ) * H_PRIME;
int l1, l2;
int l_char = 'L';
int l0 = ( ( H_PRIME * (l_char << 7) ) ^ l_char ) * H_PRIME;
int x1, x2;
int x_char = 'X';
int x0 = ( ( H_PRIME * (x_char << 7) ) ^ x_char ) * H_PRIME;
int v1, v2;
int v_char = 'V';
int v0 = ( ( H_PRIME * (v_char << 7) ) ^ v_char ) * H_PRIME;
int i1, i2;
int i_char = 'I';
int i0 = ( ( H_PRIME * (i_char << 7) ) ^ i_char ) * H_PRIME;
int q1, q2;
int mm0 = (m0 ^ q0) * H_PRIME;
int dd0 = (d0 ^ q0) * H_PRIME;
int cc0 = (c0 ^ q0) * H_PRIME;
int ll0 = (l0 ^ q0) * H_PRIME;
int xx0 = (x0 ^ q0) * H_PRIME;
int vv0 = (v0 ^ q0) * H_PRIME;
int ii0 = (i0 ^ q0) * H_PRIME;
fprintf(stderr, "%d: sv1=%d ev1=%d sv2=%d ev2=%d sv3=%d ev3=%d:\n",
q0, sv1, ev1, sv2, ev2, sv3, ev3);
// We ignore 0, 10, 13 because these three require escaping in Pytho
+n.
for (q1 = sv1; q1 < ev1; ++q1) {
if (q1 == 10 || q1 == 13) continue;
m1 = (mm0 ^ q1) * H_PRIME;
d1 = (dd0 ^ q1) * H_PRIME;
c1 = (cc0 ^ q1) * H_PRIME;
l1 = (ll0 ^ q1) * H_PRIME;
x1 = (xx0 ^ q1) * H_PRIME;
v1 = (vv0 ^ q1) * H_PRIME;
i1 = (ii0 ^ q1) * H_PRIME;
for (q2 = sv2; q2 < ev2; ++q2) {
if (q2 == 10 || q2 == 13) continue;
m2 = (m1 ^ q2) * H_PRIME;
d2 = (d1 ^ q2) * H_PRIME;
c2 = (c1 ^ q2) * H_PRIME;
l2 = (l1 ^ q2) * H_PRIME;
x2 = (x1 ^ q2) * H_PRIME;
v2 = (v1 ^ q2) * H_PRIME;
i2 = (i1 ^ q2) * H_PRIME;
int isoln = inner(sv3, ev3, m2, d2, c2, l2, x2, v2, i2, soln);
if (isoln > 0) {
int k;
for (k = 0; k < isoln; ++k) {
fprintf(stderr, "%d %d %d %hd %hd %hd %hd %hd %hd %hd\n",
q0, q1, q2, soln[k].m128i_i16[0],
soln[k].m128i_i16[1], soln[k].m128i_i16[2], soln[k].m128i_i1
+6[3],
soln[k].m128i_i16[4], soln[k].m128i_i16[5], soln[k].m128i_i1
+6[6]);
}
}
}
}
tend = time(NULL);
cend = clock();
fprintf(stderr, "(wall clock time:%ld secs, cpu time:%.2f units)\n"
+,
(long) (difftime(tend, tstart)+0.5),
(double) (cend-cstart) / (double)CLOCKS_PER_SEC);
}
int main(int argc, char* argv[])
{
int sv0, sv1, sv2, sv3, ev1, ev2, ev3;
if (argc != 8) { fprintf(stderr, "usage: prog sv0 sv1 ev1 sv2 ev2 s
+v3 ev3\n"); return 1; }
sv0 = atoi(argv[1]);
sv1 = atoi(argv[2]);
ev1 = atoi(argv[3]);
sv2 = atoi(argv[4]);
ev2 = atoi(argv[5]);
sv3 = atoi(argv[6]);
ev3 = atoi(argv[7]);
do_one(sv0, sv1, ev1, sv2, ev2, sv3, ev3);
return 0;
}
And here is the "inner loop" function, in its own file, inner1.c:
#define H_PRIME 1000003
#define HASH_LEN 11
// my_m128_t mimics Intel __m128i type
typedef struct my_m128_t my_m128_t;
struct my_m128_t { short m128i_i16[8]; };
// Simulate python modulo operator. Assumes mod is positive.
static int py_mod(int j, int mod)
{
int ret;
if (j >= 0) {
ret = j % mod;
} else {
ret = j - (j / mod) * mod;
if (ret) ret += mod;
}
return ret;
}
// Apart from:
// -1 % 1001 == 1000
// -2 % 1002 == 1000
// both -1 and -2 can never match any of 1000, 500, ..., 1
// Thus don't worry about the Python hash conversion from -1 to -2, ex
+cept for M
int inner(
int startval, int endval,
int m2, int d2, int c2, int l2, int x2, int v2, int i2,
my_m128_t* ps)
{
int m, d, c, l, x, v, i;
int m3, m4, m5, m6, m7, m8, m9;
int d3, d4, d5, d6, d7, d8, d9;
int c3, c4, c5, c6, c7, c8, c9;
int l3, l4, l5, l6, l7, l8, l9;
int x3, x4, x5, x6, x7, x8, x9;
int v3, v4, v5, v6, v7, v8, v9;
int i3, i4, i5, i6, i7, i8, i9;
int q3, q4, q5, q6, q7, q8, q9;
int iret = 0;
for (q3 = startval; q3 < endval; ++q3) {
if (q3 == 10 || q3 == 13) continue;
m3 = (m2 ^ q3) * H_PRIME;
d3 = (d2 ^ q3) * H_PRIME;
c3 = (c2 ^ q3) * H_PRIME;
l3 = (l2 ^ q3) * H_PRIME;
x3 = (x2 ^ q3) * H_PRIME;
v3 = (v2 ^ q3) * H_PRIME;
i3 = (i2 ^ q3) * H_PRIME;
for (q4 = 1; q4 < 128; ++q4) {
if (q4 == 10 || q4 == 13) continue;
m4 = (m3 ^ q4) * H_PRIME;
d4 = (d3 ^ q4) * H_PRIME;
c4 = (c3 ^ q4) * H_PRIME;
l4 = (l3 ^ q4) * H_PRIME;
x4 = (x3 ^ q4) * H_PRIME;
v4 = (v3 ^ q4) * H_PRIME;
i4 = (i3 ^ q4) * H_PRIME;
for (q5 = 1; q5 < 128; ++q5) {
if (q5 == 10 || q5 == 13) continue;
m5 = (m4 ^ q5) * H_PRIME;
d5 = (d4 ^ q5) * H_PRIME;
c5 = (c4 ^ q5) * H_PRIME;
l5 = (l4 ^ q5) * H_PRIME;
x5 = (x4 ^ q5) * H_PRIME;
v5 = (v4 ^ q5) * H_PRIME;
i5 = (i4 ^ q5) * H_PRIME;
for (q6 = 1; q6 < 128; ++q6) {
if (q6 == 10 || q6 == 13) continue;
m6 = (m5 ^ q6) * H_PRIME;
d6 = (d5 ^ q6) * H_PRIME;
c6 = (c5 ^ q6) * H_PRIME;
l6 = (l5 ^ q6) * H_PRIME;
x6 = (x5 ^ q6) * H_PRIME;
v6 = (v5 ^ q6) * H_PRIME;
i6 = (i5 ^ q6) * H_PRIME;
for (q7 = 1; q7 < 128; ++q7) {
if (q7 == 10 || q7 == 13) continue;
m7 = (m6 ^ q7) * H_PRIME;
d7 = (d6 ^ q7) * H_PRIME;
c7 = (c6 ^ q7) * H_PRIME;
l7 = (l6 ^ q7) * H_PRIME;
x7 = (x6 ^ q7) * H_PRIME;
v7 = (v6 ^ q7) * H_PRIME;
i7 = (i6 ^ q7) * H_PRIME;
for (q8 = 1; q8 < 128; ++q8) {
if (q8 == 10 || q8 == 13) continue;
m8 = (m7 ^ q8) * H_PRIME;
d8 = (d7 ^ q8) * H_PRIME;
c8 = (c7 ^ q8) * H_PRIME;
l8 = (l7 ^ q8) * H_PRIME;
x8 = (x7 ^ q8) * H_PRIME;
v8 = (v7 ^ q8) * H_PRIME;
i8 = (i7 ^ q8) * H_PRIME;
for (q9 = 1; q9 < 128; ++q9) {
if (q9 == 10 || q9 == 13) continue;
m9 = (m8 ^ q9) ^ HASH_LEN;
if (m9 == -1) m9 = -2;
m = py_mod(m9, 1001);
if (m != 1000) continue;
d9 = (d8 ^ q9) ^ HASH_LEN;
d = py_mod(d9, 1001);
if (d != 500) continue;
c9 = (c8 ^ q9) ^ HASH_LEN;
c = py_mod(c9, 1001);
if (c != 100) continue;
l9 = (l8 ^ q9) ^ HASH_LEN;
l = py_mod(l9, 1001);
if (l != 50) continue;
x9 = (x8 ^ q9) ^ HASH_LEN;
x = py_mod(x9, 1001);
if (x != 10) continue;
v9 = (v8 ^ q9) ^ HASH_LEN;
v = py_mod(v9, 1001);
if (v != 5) continue;
i9 = (i8 ^ q9) ^ HASH_LEN;
i = py_mod(i9, 1001);
if (i != 1) continue;
ps[iret].m128i_i16[0] = q3;
ps[iret].m128i_i16[1] = q4;
ps[iret].m128i_i16[2] = q5;
ps[iret].m128i_i16[3] = q6;
ps[iret].m128i_i16[4] = q7;
ps[iret].m128i_i16[5] = q8;
ps[iret].m128i_i16[6] = q9;
++iret;
}
}
}
}
}
}
}
return iret;
}
This is incredible. You must be impossibly stubborn to go through such a tedious work. :)
-- wazoox
Yes wazoox, that was tedious.
Yet the simple, if tedious, unrolling of the py_hash function sped up the code by a factor of four.
Down to an estimated running time of 62,860 years now.
Still a lot more work to do before we can break out the champagne.
Tactical vs Strategic Optimizations
Rule 1: Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is.
Rule 2: Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
-- Rob Pike
When in doubt, use brute force.
-- Ken Thompson
There are many micro-optimizations we could apply to speed up this inner loop.
For example, we shall see later how to apply
the Intel AVX instructions
to vectorize this code:
m3 = (m2 ^ q3) * H_PRIME;
d3 = (d2 ^ q3) * H_PRIME;
c3 = (c2 ^ q3) * H_PRIME;
l3 = (l2 ^ q3) * H_PRIME;
x3 = (x2 ^ q3) * H_PRIME;
v3 = (v2 ^ q3) * H_PRIME;
i3 = (i2 ^ q3) * H_PRIME;
Though AVX does indeed give a minor speed boost --
as do many other tweaks yet to be discussed --
to achieve an order of magnitude performance boost
we need to think strategically, look at the bigger picture.
What overall strategies should we apply to speed up this searcher?
Specifically, can we eliminate the innermost loop somehow? How?
Find out in the next exciting episode.
References
Updated 1 May 2014: Updated find1.cpp main() to take command-line arguments specifying a range of values to search for.
Updated 31 May 2014: added Grimy's one stroke Perl improvement
plus a Perl version of PHP md5 magic formula.