Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

The 10**21 Problem (Part I)

by eyepopslikeamosquito (Archbishop)
on Apr 21, 2014 at 17:01 UTC ( [id://1083046]=perlmeditation: print w/replies, xml ) Need Help??

When I looked for magic formulae, etc., I would give up after 20 minutes!

-- Jasper

Unlike Jasper, I have spent more than 20 minutes searching for magic formulae. Especially ancient Roman ones. I first became aware of them back in 2006 during the Fonality Christmas Golf Challenge where I was astonished by Ton's ingenuity and deviousness in constructing his original HART (Hospelian Arabic to Roman Transform) magic formula.

Shortly after the Fonality game, codegolf.com hosted an endless competition where you must convert the other way, from Roman to Decimal. By 2009, I had managed to take the lead in all four languages (Perl, Python, Ruby, PHP), as detailed in this series of nodes. And the winning solutions -- in all four languages -- all used (quite different) magic formulae!

To refresh your memory, the codegolf game rules were essentially:

Convert a Roman numeral to its integer value. The input numeral, provided on stdin followed by a single newline, is guaranteed to be in the range I to MMMCMXCIX (1 to 3999). For example, given IV on stdin, your program should print 4 to stdout.

As you might expect, a key component of solutions to this game is mapping individual Roman letters to their decimal equivalents. To help us focus on that, let's define a simpler spec:

Write a function to convert a single Roman Numeral letter to its decimal equivalent. The function assumes the Roman letter is in $_ and returns its decimal equivalent.

To clarify, here's a sample (non-golfed) Perl solution:

sub r { my %h = ( I=>1, V=>5, X=>10, L=>50, C=>100, D=>500, M=>1000 ); return $h{$_}; } print "$_: ", 0+r(), "\n" for (qw(I V X L C D M));
Running this program produces:
I: 1 V: 5 X: 10 L: 50 C: 100 D: 500 M: 1000

Lookup Table vs Magic Formula

In code golf, there is often a battle between a lookup table and a magic formula. So it proved here. When converting from Roman to Decimal, magic formula trumps lookup table.

Here are three attempts to solve this problem using a lookup table:

sub r {%h=(I,1,V,5,X,10,L,50,C,100,D,500,M,1000);$h{$_}} sub r {M1000D500C100L50X10V5I1=~$_;$'} sub r {M999D499C99L49X9V4I=~$_+$'}

And here are some of my earliest magic formulae from 2007:

sub r {5**y/VLD//.E.(3*/M/+2*/C|D/+/X|L/)} sub r {1 .E.~-ord()*41%52%5>>y/VLD//} sub r {5**y/VLD//.E.ord()x3%75%50%4} sub r {1 .E.(72^ord)*5/7%5>>y/VLD//} sub r {5**y/VLD//.E.(42^88*ord)%5} sub r {1 .E.(3^77%ord)%7>>y/VLD//}
Back then, my magic formula searching skills were so poor that I erroneously concluded that the lookup table approach was better. :-)

Later, when I tackled this problem in Python, I really needed to use each Roman letter once only in the formula, which forced me to explore alternative approaches ... which, in turn, led to still shorter Perl magic formulae, such as:

sub r {10**(7&69303333/ord)%9995} sub r {10**(7&5045e8/ord)%2857} # needs 64-bit Perl sub r {IXCMVLD=~$_;"1E@-"%9995} sub r {XCMVLD=~$_;"1E@+"%9995} # Grimy improvement
Back in the good old days, the little search programs that uncovered these solutions took hours to run, not years. :-)

The long numbers (such as 69303333 above) that began popping up in these formulae were an indication that the ord function didn't scale very well as the required solutions became less probable. Can we find a built-in function better suited to Roman magic formulae? In PHP and Python, yes. In Perl and Ruby, probably not. At least, I don't know of one.

The PHP built-in md5 function is perfect for Roman magic formulae. Not only does it generate all required digits (0-9), it generates little else (just a-f). Moreover, you can use all 256 characters in a magic formula, compared to just ten (0-9) for ord. To illustrate, in an eight character magic string (for example 69303333), there are just 10**8 combinations available for ord, while there are a mind-boggling 256**8=1.8*10**19 combinations available for md5! This is a huge advantage when searching for highly improbable solutions, as we shall see later.

The Python hash function is also superior to ord, though not as good as PHP's md5. This is because Python's Unicode and character escaping claptrap limits you to 125 characters (namely ord 1..9,11,12,14..127) that can be used as input to the hash function without special treatment. Still, 125 is a huge improvement over 10! One drawback of hash compared to md5 is that it generates huge numbers, forcing you to waste five strokes with %1001 to trim the generated numbers into the required Roman Numeral range (1-1000).

In some cases, the Perl crypt built-in could be employed, though it is not well-suited to this specific problem because (unlike md5) it generates many other characters in addition to the desired 0-9.

To recap, the shortest magic formulae found so far in all four languages (as detailed in this series of nodes) are:

10**(205558%ord(r)%7)%9995 # Python hash(r+"magicstrng")%1001 # Python (finding magicstrng is the subjec +t of this node!) VLD=~$_*5+IXCM=~$_."E@-" # Perl 10**(7&5045e8/ord)%2857 # Perl (64-bit) IXCMVLD=~$_;"1E@-"%9995 # Perl XCMVLD=~$_;"1E@+"%9995 # Perl (update: Grimy improvement) uppp&md5_hex$_.PQcUv # Perl (needs Digest::MD5 module) 10**(494254%r/9)%4999 # Ruby (no need for explicit ord in this g +ame) md5($r.magicstrng) # PHP (finding magicstrng is an unsolved p +roblem) md5($r.PQcUv)&uppp # PHP wins due to superb mf properties of +md5

The 10**21 Problem

There's just no sane regularity in this thing. But in "random" mappings with a very small result set like this, the shortest solution is often to make up some magic formula that has no particular meaning, but just happens to give the wanted result.

-- Ton Hospel explaining his original decimal-to-roman magic formula

When Ton Hospel invented magic formulae in golf way back in 2004, he correctly noted that they work best "with a very small result set". Indeed, if there were only five Roman Numeral letters, rather than seven, we would have a straightforward 10**15 problem, rather than a (borderline intractable) 10**21 problem. Why 10**21?

Suppose you have a magic formula that produces a random number between 1 and 1000. The probability of scoring a (lucky) hit for one numeral is therefore 1 in 1000. Since probabilities multiply, the probability of hitting five out of five numerals is 1 in 10**15, while the chance of hitting all seven Roman Numerals is a daunting 1 in 10**21. Though 1 in 10**21 sounds improbable in the extreme, if you could generate 10**21 combinations, you would not need luck, indeed you would expect to find such an improbable solution. Yet how long would it take to search 10**21 combinations for the elusive (lucky) one?

Well, if you could perform 10,000 search operations per second, the time required to search 10**21 combinations is a mere 10**21/10000 seconds = 10**17 seconds = 3,170,979,198 years! By the way, this "brute force search infeasibility problem" is why you are asked to create longer and longer passwords, and with a wider range of characters in them, as computer speeds improve. Indeed, Password cracking and magic formula searching are closely related disciplines, the lack of a "codegolf magic formula" wikipedia page notwithstanding. :-)

To make this theoretical argument more concrete, consider searching for a Roman to Decimal magic formula using the Python built-in hash function. Since this hash function produces very large values, we need to apply %1001 to it so as to produce an essentially random value in the desired 1..1000 range. We might try searching for such a magic formula using a simple Python brute-force search program, such as:

import time print time.time(), time.clock() for q0 in range(1, 128): for q1 in range(1, 128): for q2 in range(1, 128): for q3 in range(1, 128): for q4 in range(1, 128): for q5 in range(1, 128): for q6 in range(1, 128): for q7 in range(1, 128): for q8 in range(1, 128): for q9 in range(1, 128): magic = chr(q0)+chr(q1)+chr(q2)+chr(q3)+chr(q4)+chr(q5)+chr( +q6)+chr(q7)+chr(q8)+chr(q9) m = hash("M" + magic) % 1001 if m != 1000: continue d = hash("D" + magic) % 1001 if d != 500: continue c = hash("C" + magic) % 1001 if c != 100: continue l = hash("L" + magic) % 1001 if l != 50: continue x = hash("X" + magic) % 1001 if x != 10: continue v = hash("V" + magic) % 1001 if v != 5: continue i = hash("I" + magic) % 1001 if i != 1: continue print "bingo!", q0, q1, q2, q3, q4, q5, q6, q7, q8, q9 print time.time(), time.clock()
On my machine, this naive brute force search for a ten-character magic string is expected to find a solution in about 52,887,477 years! Note that, with 127 different values for each character in the string, we need a 10-character magic string to give us the required 127**10 = 1.1e21 = 10**21 combinations.

Rather than give up at this point, I found this "impossible" 50 million year challenge strangely irresistible ... and was eager to learn about any high performance computing techniques required to solve it.

I get it -- optimization is a fun game ... one can play all day with unrolling loops, peeling away layers of indirection, and so forth to gain cycles, while piddling away time and energy

-- davido

Or, as davido puts it, "optimization is a fun game". Well, I enjoy it. So I started by rewriting the above search program in C++, then kept on refining it until I was able to reduce the running time from fifty million years down to only several years and eventually find a solution. This new series of articles describes that endeavour.

Please note that this series focuses more on High Performance Computing/Supercomputing/Parallel computing in C/C++/Intel assembler than code golf.

First Attempt

Re-writing our crude Python searcher in C/C++ we get:

#include <stdio.h> #include <time.h> // Python hash() function. Assumes 32-bit int. // Derived from string_hash() in stringobject.c in Python 2.5.1 source +s. static int py_hash(const char* a, int size) { int len = size; const unsigned char* p = (unsigned char*)a; int x = *p << 7; while (--len >= 0) x = (1000003 * x) ^ *p++; x ^= size; if (x == -1) x = -2; return x; } // Simulate python modulo operator. Assumes mod is positive. static int py_mod(int j, int mod) { int ret; if (j >= 0) { ret = j % mod; } else { ret = j - (j / mod) * mod; if (ret) ret += mod; } return ret; } int main() { int q0, q1, q2, q3, q4, q5, q6, q7, q8, q9; int m, d, c, l, x, v, i; char magic[16]; time_t tstart = time(NULL); clock_t cstart = clock(); time_t tend; clock_t cend; if (sizeof(int) != 4) { fprintf(stderr, "oops sizeof int != 4\n"); +return 1; } for (q0 = 1; q0 < 128; ++q0) { magic[1] = q0; for (q1 = 1; q1 < 128; ++q1) { magic[2] = q1; for (q2 = 1; q2 < 128; ++q2) { magic[3] = q2; for (q3 = 1; q3 < 128; ++q3) { magic[4] = q3; for (q4 = 1; q4 < 128; ++q4) { magic[5] = q4; for (q5 = 1; q5 < 128; ++q5) { magic[6] = q5; for (q6 = 1; q6 < 128; ++q6) { magic[7] = q6; for (q7 = 1; q7 < 128; ++q7) { magic[8] = q7; for (q8 = 1; q8 < 128; ++q8) { magic[9] = q8; for (q9 = 1; q9 < 128; ++q9) { magic[10] = q9; magic[0] = 'M'; m = py_mod(py_hash(magic, 11), 1001); if (m != 1000) continue; magic[0] = 'D'; d = py_mod(py_hash(magic, 11), 1001); if (d != 500) continue; magic[0] = 'C'; c = py_mod(py_hash(magic, 11), 1001); if (c != 100) continue; magic[0] = 'L'; l = py_mod(py_hash(magic, 11), 1001); if (l != 50) continue; magic[0] = 'X'; x = py_mod(py_hash(magic, 11), 1001); if (x != 10) continue; magic[0] = 'V'; v = py_mod(py_hash(magic, 11), 1001); if (v != 5) continue; magic[0] = 'I'; i = py_mod(py_hash(magic, 11), 1001); if (i != 1) continue; printf("bingo! %d %d %d %d %d %d %d %d %d %d\n", q0, q1, q2, q3, q4, q5, q6, q7, q8, q9); } } } } } } } } } } tend = time(NULL); cend = clock(); printf("(wall clock time:%ld secs, cpu time:%.2f units)\n", (long) (difftime(tend, tstart)+0.5), (double) (cend-cstart) / (double)CLOCKS_PER_SEC); return 0; }
which ran about 190 times faster than the Python version. Down to a running time of 275,530 years now. Woohoo!

Unroll the hash and reorg the code

An obvious optimization is to "unroll" the py_hash() function within the loop; that is, incrementally build the hash value within the loop and remove the py_hash() function.

Note, by the way, that this algorithm is very well-suited to multi-threading (embarrassingly parallel); each thread can be given a different starting and ending range to search and just go at it; there is no need for any thread synchronization at all. So I took the opportunity at this stage to reorganize the code, to factor out the "inner loop" into its own file with a single function (with a C call interface) that just contains raw calculation, no global or static variables, no library calls, no I/O. Just code. This reorganization has a number of benefits:

  • Easy to make the inner loop code thread-safe.
  • Easy to call from multiple threads (having the "inner loop" function do no I/O simplifies this).
  • Gives you the option to rewrite the "inner loop" file (in assembler, say) so long as you honor its interface.
  • Easier to port. The code that launches threads will likely be different on each platform (e.g. POSIX vs Win32 threads), the I/O code may be different on each platform, while the inner loop code can be the same on all platforms (or tailored to suit).
  • From now on, I can focus on the crucial inner loop code, and so have less boilerplate C++ code cluttering this meditation. :-)
Of course, I didn't want function call overhead to inhibit this reorganization, so I arbitrarily made the new inner loop function perform the innermost seven loops.

After the code reorganization and hash function unrolling, here is the new main program, find1.cpp:

#include <stdio.h> #include <string.h> #include <stdlib.h> #include <time.h> #define H_PRIME 1000003 // Most unlikely to get this many hits in a single call of the inner l +oop function. #define MAX_HITS 64 // my_m128_t mimics Intel __m128i type (which will be used later) typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; // the inner loop function (lives in a separate file) extern "C" int inner( int startval, int endval, int m2, int d2, int c2, int l2, int x2, int v2, int i2, my_m128_t* ps ); void do_one(int q0, int sv1, int ev1, int sv2, int ev2, int sv3, int e +v3) { time_t tstart = time(NULL); clock_t cstart = clock(); time_t tend; clock_t cend; my_m128_t soln[MAX_HITS]; int m1, m2; int m_char = 'M'; int m0 = ( ( H_PRIME * (m_char << 7) ) ^ m_char ) * H_PRIME; int d1, d2; int d_char = 'D'; int d0 = ( ( H_PRIME * (d_char << 7) ) ^ d_char ) * H_PRIME; int c1, c2; int c_char = 'C'; int c0 = ( ( H_PRIME * (c_char << 7) ) ^ c_char ) * H_PRIME; int l1, l2; int l_char = 'L'; int l0 = ( ( H_PRIME * (l_char << 7) ) ^ l_char ) * H_PRIME; int x1, x2; int x_char = 'X'; int x0 = ( ( H_PRIME * (x_char << 7) ) ^ x_char ) * H_PRIME; int v1, v2; int v_char = 'V'; int v0 = ( ( H_PRIME * (v_char << 7) ) ^ v_char ) * H_PRIME; int i1, i2; int i_char = 'I'; int i0 = ( ( H_PRIME * (i_char << 7) ) ^ i_char ) * H_PRIME; int q1, q2; int mm0 = (m0 ^ q0) * H_PRIME; int dd0 = (d0 ^ q0) * H_PRIME; int cc0 = (c0 ^ q0) * H_PRIME; int ll0 = (l0 ^ q0) * H_PRIME; int xx0 = (x0 ^ q0) * H_PRIME; int vv0 = (v0 ^ q0) * H_PRIME; int ii0 = (i0 ^ q0) * H_PRIME; fprintf(stderr, "%d: sv1=%d ev1=%d sv2=%d ev2=%d sv3=%d ev3=%d:\n", q0, sv1, ev1, sv2, ev2, sv3, ev3); // We ignore 0, 10, 13 because these three require escaping in Pytho +n. for (q1 = sv1; q1 < ev1; ++q1) { if (q1 == 10 || q1 == 13) continue; m1 = (mm0 ^ q1) * H_PRIME; d1 = (dd0 ^ q1) * H_PRIME; c1 = (cc0 ^ q1) * H_PRIME; l1 = (ll0 ^ q1) * H_PRIME; x1 = (xx0 ^ q1) * H_PRIME; v1 = (vv0 ^ q1) * H_PRIME; i1 = (ii0 ^ q1) * H_PRIME; for (q2 = sv2; q2 < ev2; ++q2) { if (q2 == 10 || q2 == 13) continue; m2 = (m1 ^ q2) * H_PRIME; d2 = (d1 ^ q2) * H_PRIME; c2 = (c1 ^ q2) * H_PRIME; l2 = (l1 ^ q2) * H_PRIME; x2 = (x1 ^ q2) * H_PRIME; v2 = (v1 ^ q2) * H_PRIME; i2 = (i1 ^ q2) * H_PRIME; int isoln = inner(sv3, ev3, m2, d2, c2, l2, x2, v2, i2, soln); if (isoln > 0) { int k; for (k = 0; k < isoln; ++k) { fprintf(stderr, "%d %d %d %hd %hd %hd %hd %hd %hd %hd\n", q0, q1, q2, soln[k].m128i_i16[0], soln[k].m128i_i16[1], soln[k].m128i_i16[2], soln[k].m128i_i1 +6[3], soln[k].m128i_i16[4], soln[k].m128i_i16[5], soln[k].m128i_i1 +6[6]); } } } } tend = time(NULL); cend = clock(); fprintf(stderr, "(wall clock time:%ld secs, cpu time:%.2f units)\n" +, (long) (difftime(tend, tstart)+0.5), (double) (cend-cstart) / (double)CLOCKS_PER_SEC); } int main(int argc, char* argv[]) { int sv0, sv1, sv2, sv3, ev1, ev2, ev3; if (argc != 8) { fprintf(stderr, "usage: prog sv0 sv1 ev1 sv2 ev2 s +v3 ev3\n"); return 1; } sv0 = atoi(argv[1]); sv1 = atoi(argv[2]); ev1 = atoi(argv[3]); sv2 = atoi(argv[4]); ev2 = atoi(argv[5]); sv3 = atoi(argv[6]); ev3 = atoi(argv[7]); do_one(sv0, sv1, ev1, sv2, ev2, sv3, ev3); return 0; }

And here is the "inner loop" function, in its own file, inner1.c:

#define H_PRIME 1000003 #define HASH_LEN 11 // my_m128_t mimics Intel __m128i type typedef struct my_m128_t my_m128_t; struct my_m128_t { short m128i_i16[8]; }; // Simulate python modulo operator. Assumes mod is positive. static int py_mod(int j, int mod) { int ret; if (j >= 0) { ret = j % mod; } else { ret = j - (j / mod) * mod; if (ret) ret += mod; } return ret; } // Apart from: // -1 % 1001 == 1000 // -2 % 1002 == 1000 // both -1 and -2 can never match any of 1000, 500, ..., 1 // Thus don't worry about the Python hash conversion from -1 to -2, ex +cept for M int inner( int startval, int endval, int m2, int d2, int c2, int l2, int x2, int v2, int i2, my_m128_t* ps) { int m, d, c, l, x, v, i; int m3, m4, m5, m6, m7, m8, m9; int d3, d4, d5, d6, d7, d8, d9; int c3, c4, c5, c6, c7, c8, c9; int l3, l4, l5, l6, l7, l8, l9; int x3, x4, x5, x6, x7, x8, x9; int v3, v4, v5, v6, v7, v8, v9; int i3, i4, i5, i6, i7, i8, i9; int q3, q4, q5, q6, q7, q8, q9; int iret = 0; for (q3 = startval; q3 < endval; ++q3) { if (q3 == 10 || q3 == 13) continue; m3 = (m2 ^ q3) * H_PRIME; d3 = (d2 ^ q3) * H_PRIME; c3 = (c2 ^ q3) * H_PRIME; l3 = (l2 ^ q3) * H_PRIME; x3 = (x2 ^ q3) * H_PRIME; v3 = (v2 ^ q3) * H_PRIME; i3 = (i2 ^ q3) * H_PRIME; for (q4 = 1; q4 < 128; ++q4) { if (q4 == 10 || q4 == 13) continue; m4 = (m3 ^ q4) * H_PRIME; d4 = (d3 ^ q4) * H_PRIME; c4 = (c3 ^ q4) * H_PRIME; l4 = (l3 ^ q4) * H_PRIME; x4 = (x3 ^ q4) * H_PRIME; v4 = (v3 ^ q4) * H_PRIME; i4 = (i3 ^ q4) * H_PRIME; for (q5 = 1; q5 < 128; ++q5) { if (q5 == 10 || q5 == 13) continue; m5 = (m4 ^ q5) * H_PRIME; d5 = (d4 ^ q5) * H_PRIME; c5 = (c4 ^ q5) * H_PRIME; l5 = (l4 ^ q5) * H_PRIME; x5 = (x4 ^ q5) * H_PRIME; v5 = (v4 ^ q5) * H_PRIME; i5 = (i4 ^ q5) * H_PRIME; for (q6 = 1; q6 < 128; ++q6) { if (q6 == 10 || q6 == 13) continue; m6 = (m5 ^ q6) * H_PRIME; d6 = (d5 ^ q6) * H_PRIME; c6 = (c5 ^ q6) * H_PRIME; l6 = (l5 ^ q6) * H_PRIME; x6 = (x5 ^ q6) * H_PRIME; v6 = (v5 ^ q6) * H_PRIME; i6 = (i5 ^ q6) * H_PRIME; for (q7 = 1; q7 < 128; ++q7) { if (q7 == 10 || q7 == 13) continue; m7 = (m6 ^ q7) * H_PRIME; d7 = (d6 ^ q7) * H_PRIME; c7 = (c6 ^ q7) * H_PRIME; l7 = (l6 ^ q7) * H_PRIME; x7 = (x6 ^ q7) * H_PRIME; v7 = (v6 ^ q7) * H_PRIME; i7 = (i6 ^ q7) * H_PRIME; for (q8 = 1; q8 < 128; ++q8) { if (q8 == 10 || q8 == 13) continue; m8 = (m7 ^ q8) * H_PRIME; d8 = (d7 ^ q8) * H_PRIME; c8 = (c7 ^ q8) * H_PRIME; l8 = (l7 ^ q8) * H_PRIME; x8 = (x7 ^ q8) * H_PRIME; v8 = (v7 ^ q8) * H_PRIME; i8 = (i7 ^ q8) * H_PRIME; for (q9 = 1; q9 < 128; ++q9) { if (q9 == 10 || q9 == 13) continue; m9 = (m8 ^ q9) ^ HASH_LEN; if (m9 == -1) m9 = -2; m = py_mod(m9, 1001); if (m != 1000) continue; d9 = (d8 ^ q9) ^ HASH_LEN; d = py_mod(d9, 1001); if (d != 500) continue; c9 = (c8 ^ q9) ^ HASH_LEN; c = py_mod(c9, 1001); if (c != 100) continue; l9 = (l8 ^ q9) ^ HASH_LEN; l = py_mod(l9, 1001); if (l != 50) continue; x9 = (x8 ^ q9) ^ HASH_LEN; x = py_mod(x9, 1001); if (x != 10) continue; v9 = (v8 ^ q9) ^ HASH_LEN; v = py_mod(v9, 1001); if (v != 5) continue; i9 = (i8 ^ q9) ^ HASH_LEN; i = py_mod(i9, 1001); if (i != 1) continue; ps[iret].m128i_i16[0] = q3; ps[iret].m128i_i16[1] = q4; ps[iret].m128i_i16[2] = q5; ps[iret].m128i_i16[3] = q6; ps[iret].m128i_i16[4] = q7; ps[iret].m128i_i16[5] = q8; ps[iret].m128i_i16[6] = q9; ++iret; } } } } } } } return iret; }

This is incredible. You must be impossibly stubborn to go through such a tedious work. :)

-- wazoox

Yes wazoox, that was tedious. Yet the simple, if tedious, unrolling of the py_hash function sped up the code by a factor of four. Down to an estimated running time of 62,860 years now. Still a lot more work to do before we can break out the champagne.

Tactical vs Strategic Optimizations

Rule 1: Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is.
Rule 2: Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.

-- Rob Pike

When in doubt, use brute force.

-- Ken Thompson

There are many micro-optimizations we could apply to speed up this inner loop. For example, we shall see later how to apply the Intel AVX instructions to vectorize this code:

m3 = (m2 ^ q3) * H_PRIME; d3 = (d2 ^ q3) * H_PRIME; c3 = (c2 ^ q3) * H_PRIME; l3 = (l2 ^ q3) * H_PRIME; x3 = (x2 ^ q3) * H_PRIME; v3 = (v2 ^ q3) * H_PRIME; i3 = (i2 ^ q3) * H_PRIME;
Though AVX does indeed give a minor speed boost -- as do many other tweaks yet to be discussed -- to achieve an order of magnitude performance boost we need to think strategically, look at the bigger picture. What overall strategies should we apply to speed up this searcher? Specifically, can we eliminate the innermost loop somehow? How? Find out in the next exciting episode.

References

References Added Later

Updated 1 May 2014: Updated find1.cpp main() to take command-line arguments specifying a range of values to search for. Updated 31 May 2014: added Grimy's one stroke Perl improvement plus a Perl version of PHP md5 magic formula.

Replies are listed 'Best First'.
Re: The 10**21 Problem (Part I)
by ambrus (Abbot) on Apr 24, 2014 at 18:24 UTC

    Let me guess about the next optimizations you have made after this?

    Firstly, bitwise xor is associative, so there's no point in computing the ^ HASH_LEN in the innermost loop, you can put that to m8..i8 instead of m9..i9. Eg.

    m8 = (m7 ^ q8) * H_PRIME ^ HASH_LEN; m9 = m8 ^ q9;
    Though the compiler might already be smart enough to notice this, so it might not have any effect.

    Secondly, I wouldn't bother computing c8..i8 unless m9 is correct.

    Thirdly, there is no point looping over q9. Instead, notice that as q9 takes all possible values from 0..127, m9 will take the values from (m8&~127)..(m8&~127)+127 in a different order. Among those 127 possible values of m9, there can only be one that has the right value modulo 1001. You can compute that value directly, and try the corresponding single value for q9, checking if it lies in the range.

    However, I wonder, does the magic formula really have to use exactly 1001 as the modulus? I think it could use any modulus between 1001 and 9999.

    To search for the magic formula in that case, first compute t = gcd(m9 - 1000, d9 - 500) (be careful with underflows). Then compute the small prime factors of t, and using this, all divisors of t between 1001 and 9999, and check all those divisors as the modulus against the rest of the numbers. In the rare lucky case that t is 0, you will have to try all moduluses (or all divisors of c9 - 100).

    Most of the time, t should be between 1 and 1000, in case you don't have to compute a factorization because no modulus can work. Because of this, you should probably not bother to compute c9..i9 and c8..i8 unless t is greater than 1000 or zero.

    This would certainly be more difficult to implement, and might take more time for each individual magic string, but it also has a higher chance of finding a formula, so it might be worth.

    Update: fixed a typo from (m8&!127)+127 to (m8&~127)+127 that eyepopslikeamosquito noticed.

Re: The 10**21 Problem (Part I)
by davido (Cardinal) on Apr 21, 2014 at 19:54 UTC

    ++ for the entertaining post, and the well drawn conclusions.


    Dave

Re: The 10**21 Problem (Part I)
by oiskuu (Hermit) on May 05, 2014 at 20:08 UTC

    Regarding the hash function and the modulus.

    Firstly, the if (m9 == -1) continue; clause could just as well be omitted—this may only result in false positives, not a false negative. Unlikely to matter though. I would also test if skipping the 10/13 cmps in the innermost loop(s) might improve performance!

    Apart from the oddball -1 case, the py_hash() function does not feed high order bits back into the register. Bits propagate higher; low bits of the hash value depend on low bits of input only. The python modulus is trickier. It ((1001+v%1001) % 1001) can be evaluated via unsigned modulus: ((v-(v<0)*620U) % 1001). The sign bit therefore mixes into the 3rd lowest bit.

    Compilers know how to replace a divide-by-constant with multiplication by its reciprocal. (Div operation typically has a very high latency.) A division by 10, for example, may be substituted with mul 0xcccccccd; shift right by 3(+32). The shift size varies:

    udiv by 9993: magic=2746836385 (0xa3b965a1); shift=14; udiv by 8089: magic=2174828291 (0x81a13f03); shift=12 udiv by 1001: magic=98685563 (0x05e1d27b); shift=10; udiv by 1419: magic=24214051 (0x01717a23); shift=3 udiv by 6019: magic=2854273 (0x002b8d81); shift=2 udiv by 4765: magic=3605429 (0x003703b5); shift=2 udiv by 6147: magic=1397419 (0x001552ab); shift=1 udiv by 2049: magic=4192257 (0x003ff801); shift=1
    Evidently, the choice of modulus can strengthen the hash function. Mod=2049 is particularly weak: (-1U%2049 +1) == 1024. Sign mixes into 10th lowest.

    Update: it is easily shown that modulus must be odd. Hash function either preserves or inverts the lowest bit of r. Even modulus would also preserve it, making (r^v)&1 constant. But that does not fit the problem.

Re: The 10**21 Problem (Part I)
by Grimy (Pilgrim) on May 30, 2014 at 08:05 UTC
    You can easily gain one character on the Perl one:
    sub u {IXCMVLD=~$_;"1E@-"%9995} sub i {XCMVLD=~$_;"1E@+"%9995} print $_, v9, u, v9, i, $/ for qw(M D C L X V I);
    Output:
    M 1000 1000 D 500 500 C 100 100 L 50 50 X 10 10 V 5 5 I 1 1

      Thanks Grimy.

      By the way, the PHP md5 solution works in Perl too since these two languages have essentially the same bitwise string operators:

      use Digest::MD5 qw(md5_hex); sub r {uppp&md5_hex$_.PQcUv} print "$_: ", 0+r(), "\n" for (qw(I V X L C D M));

      I've updated the Perl summary of shortest solutions in the root node like so:

      VLD=~$_*5+IXCM=~$_."E@-" # Perl 10**(7&5045e8/ord)%2857 # Perl (64-bit) IXCMVLD=~$_;"1E@-"%9995 # Perl XCMVLD=~$_;"1E@+"%9995 # Perl (Grimy improvement) uppp&md5_hex$_.PQcUv # Perl (needs Digest::MD5 module)

      As you can see, if Perl had a md5 built-in function, this would be the shortest Perl solution -- even with a ridiculously long md5_hex function name!

      Update: for full details on how I happened to find this md5 solution, see Re^2: The golf course looks great, my swing feels good, I like my chances (Part III).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1083046]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-09-18 20:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (25 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.