in reply to The golf course looks great, my swing feels good, I like my chances (Part III)

I could see how the hallvabo-provoked improved Python algorithm could improve my PHP score ... if only I could find an improved magic formula, one with a distinct newline mapping. Unfortunately, my original 70 stroke solution mapped newline to zero, which clashes with I -> 0. I'm happy to report that I did manage to find a very lucky 68 stroker, with a newline -> 10 mapping, as described in detail below.

Running the following PHP program:

$m = chr(205).chr(141).chr(207).chr(128).chr(206).chr(142); echo "M " . md5("M".$m) . "\n"; echo "D " . md5("D".$m) . "\n"; echo "C " . md5("C".$m) . "\n"; echo "L " . md5("L".$m) . "\n"; echo "X " . md5("X".$m) . "\n"; echo "V " . md5("V".$m) . "\n"; echo "I " . md5("I".$m) . "\n"; echo "n " . md5("\n".$m) . "\n";
M 999cb80306cbebadbd2d929ba8ed579a D 472969775e593db37081e391c072a85c C 99f4ea53c723dc31fe41bee9bd3f1fcd L 154047519861cb3d7c35ba7d283ce80e X 9fba084f9fdf4d9cd0e480e32b6d26d4 V 004b6563916984013ca50d0ee8add0e7 I daba8b73d682dc51d3c6dfa5ed770c4d n 10a89bcbe90dda3d6bb0e5ee5b8fbe21
Notice that M -> 999, C -> 99, X -> 9, V -> 4 and I -> 0 are all direct hits. And, crucial for the new algorithm, newline has a distinct value (10) thus ensuring it can be distinguished from the roman letters. Now, applying 472969775 % 2119 produces the desired D -> 499 mapping. So far, so good. Just one more miracle is required: a L -> 49 mapping with the same modulo (2119) as the D mapping.

Pop quiz: what should 154047519861 % 2119 produce? Perl, Python and Ruby are unanimous on that score: the correct answer is 157. Surprisingly, however, running the following PHP test program:

$L_num = 154047519861 % 2119; $L_str = "154047519861" % 2119; echo "L_num : $L_num\n"; echo "L_str : $L_str\n";
L_num : -1324 L_str : 49
The curious thing is that both answers are wrong. The miraculous thing is that the string "answer", $L_str above, produces the desired L -> 49 mapping! So I'm grateful for this (presumably 32-bit related) PHP bug because this lucky hit finally allowed me to break the 70 barrier:
where XXXXXX above is chr(205).chr(141).chr(207).chr(128).chr(206).chr(142). 68 strokes!

Though I admit I never expected to break the 70 barrier, I now believe that a score in the low 60s is a certainty. Here's why.

Suppose you could find a direct hit: M -> 1000, D -> 500, C -> 100, L -> 50, X -> 10, V -> 5, I -> 1, newline -> 0. You could then replace the:

claptrap with simply:
a saving of eight strokes. The comical "*1" above is to work around a PHP idiosyncrasy, to ensure that "1e3" is interpreted as 1000 and not 1.

How long would the magic string need to be to find such a solution? The probability of matching three specific characters followed by [a-f] is (1/16)*(1/16)*(1/16)*(6/16) = 1/10923, which is the rough odds of matching M, D and C: "1e3", "500", "5e2", "100", "1e2". Matching 50 and 10 is more likely: (1/16)*(1/16)*(6/16) = 1/689. Matching five and one is more likely still: (1/16) * (6/16) = 1/42. Finally, matching zero is only: 6/16 = 3/8. So the, admittedly very rough, odds of such a miraculously lucky hit is (1/10923)**3 * (1/689)**2 * (1/42)**2 * (3/8), which comes to the order of one in 10**21. Since there are 180 characters available, a ten character magic string can produce 180**10 = 10**22 distinct combinations, making such a lucky hit a virtual certainty. I therefore declare that a 64 stroke solution is certain and shorter ones possible.

This problem, however, is now no longer one of golf, but of computation. Even with super computers, 10**21 combinations is a daunting challenge and I expect you'd need to set up a cooperative grid computing project to find it. Such a project, of course, is far less worthy than SETI@home or Einstein@Home and would be a horrible waste of CPU cycles. :)

Update 2012: it seems you don't need a cooperative grid computing project anymore, you can just rent the power in the cloud for around $2 per hour. See also Cracking Passwords in The Cloud: Amazon's new EC2 GPU Instances.

Update 2014: See also The 10**21 Problem (Part I), an example of evaluating 10**21 combinations in a C program to find a Python magic formula.

Replies are listed 'Best First'.
Re^2: The golf course looks great, my swing feels good, I like my chances (Part III)
by eyepopslikeamosquito (Bishop) on Apr 25, 2010 at 07:03 UTC

    Though I'd long stopped searching for fresh ideas to try in this game, a little while after this ToastyX-provoked epiphany, I could not help but wonder if the string bitwise & PHP golfing trick might somehow be similarly applied to the older Roman to Decimal game.

    I started with an inspection of my previous best PHP solution:

    There's a lot to dislike about this code. Much nicer is:
    While such a solution undoubtedly exists (as argued last time), I don't have a powerful enough computer to find it. The reason such a long (ten character) magic string is likely to be required is because hitting the longer Roman numbers is so improbable; that is, the probability of a random MD5 digest happening to start with 100[a-f], 500[a-f] or 1000[a-f] is too low.

    Can good ol' string bitwise & help here? To find out, I started by running a little Perl program:

    for my $i (48..57,97..102) { printf " %08b (ord %3d %c)\n", $i, $i, $i; }
    to display the bit patterns of all characters that may appear in a MD5 digest:
    00110000 (ord 48, "0") 00110001 (ord 49, "1") 00110010 (ord 50, "2") 00110011 (ord 51, "3") 00110100 (ord 52, "4") 00110101 (ord 53, "5") 00110110 (ord 54, "6") 00110111 (ord 55, "7") 00111000 (ord 56, "8") 00111001 (ord 57, "9") 01100001 (ord 97, "a") 01100010 (ord 98, "b") 01100011 (ord 99, "c") 01100100 (ord 100, "d") 01100101 (ord 101, "e") 01100110 (ord 102, "f")
    Hmmm, though we clearly cannot bitwise & from [a-f] to [0-9] or vice versa, we can bitwise & all digits to zero. For example:
    00111001 (ord 57, "9" character) & 01110000 (ord 112, "p" character) = 00110000 (ord 48, "0" character)
    Does that help? Yes! You see, applying &p to the second, third and fourth characters of the MD5 digest transforms any digit in these character positions to zero, making it much more likely to hit the problematic 100, 500 and 1000 Roman numbers.

    What about the first character? A simple brute force search program showed that the best we can do here is &u, which transforms [0-9] like so:

    0 0 1 1 2 0 3 1 4 4 5 5 6 4 7 5 8 0 9 1
    thus making a starting 1 or 5 more likely.

    Putting these two together, a &uppp bitwise string operation truncates an MD5 digest string to four characters as follows:

    [0145a`de] [0`] [0`] [0`]
    The bitwise &uppp operation therefore reduces the possible numbers in a MD5 digest from millions to just thirteen, namely: 0, 1, 4, 5, 10, 40, 50, 100, 400, 500, 1000, 4000, 5000.

    The next step was to update my old C search program with an adjusted my_md5 function as follows:

    static int my_md5(const unsigned char* inbuf) { int outi = 0; int nib; MD5_CTX mdContext; #ifdef MY_ASM mdContext.A = 0x67452301; mdContext.B = 0xefcdab89; mdContext.C = 0x98badcfe; mdContext.D = 0x10325476; md5_block_asm_data_order(&mdContext, inbuf, 1); #else md5_block_asm_data_order(&mdContext, inbuf); #endif /* 1st byte */ nib = (unsigned char)((mdContext.A >> 4) & 0xF); if (nib > 9) return outi; nib &= 5; // <-- added this line outi = outi * 10 + nib; nib = (unsigned char)(mdContext.A & 0xF); if (nib > 9) return outi; nib = 0; // <-- added this line outi = outi * 10 + nib; /* 2nd byte */ nib = (unsigned char)((mdContext.A >> 12) & 0xF); if (nib > 9) return outi; nib = 0; // <-- added this line outi = outi * 10 + nib; nib = (unsigned char)((mdContext.A >> 8) & 0xF); if (nib > 9) return outi; nib = 0; // <-- added this line outi = outi * 10 + nib; return outi; // <-- only care about first four chars now }
    After adjusting M_TARG and related #defines, running this modified search program unearthed a number of five character magic string hits, but none shorter. From the many different five character solutions, I naturally chose one consisting of all alphabetic characters, namely PQcUv.

    Because PHP and Perl have essentially the same string bitwise operator behavior, I was able to test these ideas in Perl. You can see more clearly how all this works, by running the following Perl test program:

    use Digest::MD5 qw(md5_hex); for my $r (M, D, C, L, X, V, I, "\n") { my $hd = md5_hex( $r . PQcUv ); print "$r $hd ", $hd & uppp, "\n"; }
    which produces:
    M 993332d3c4fa8b3839761ca4dd480f7b 1000 D 503c3b61b971c24100e97c3297882b22 500` C 112da378c970c5a0ff7769acd85095c7 100` L 58dbff2e42141165a1e04bbc629df030 50`` X 90a86e4f0b96ada8fb44bcb43d16f25e 10`0 V 7c9f70b74e8f19325e8dc62d31a9dd87 5`0` I 9bb0bbb1977969abd73e5f8121e6cff7 1``0 c697df2fcf84f10272369ee4482e5c1c a000
    Note that the ` characters above, the result of applying &p to [a-f], terminate the number when it's used in an arithmetic operation. Note too that for PHP we require the newline case to evaluate to zero (as a000 does above) to terminate the controlling while loop (this is not required in Perl because /./g removes the trailing newline).

    Here's an example Perl solution to this game:

    use Digest::MD5 md5_hex; $\+=$n-2*$n%($n=uppp&md5_hex$_.PQcUv)for<>=~/./g;print
    Were md5 a built-in Perl function, this would be a winning Perl golf entry. Note, by the way, that it may be possible to replace md5 with Perl's built-in crypt function and be competitive in this game.

    As you might expect, having a built-in md5 function made this new approach a winner in PHP, as demonstrated by the following 63 stroker:

    Curiously, this turned out to have exactly the same length as the (theoretical) long magic string solution postulated earlier:

    The shortest known PHP solution to this game is thus reduced from 68 to 63 strokes, and without requiring a super computer.

    A general golfing lesson to take from this little episode is that if you get stuck on one golf game, try another. You may be able to apply tricks learnt in the new game to the one you were previously stuck on.