|Problems? Is your data what you think it is?|
Unlike Ruby and Python, I knew nothing of PHP when this game began. Nothing. Zilch. Nada. And trawling through over 3000 verbose, and often appallingly inconsistent, built-in functions did not make a good first impression. More on this later, but, as a language, PHP is streets behind Perl, Ruby and Python IMHO.
Allow me to state this more plainly: if I had to program in PHP for a living, I would poke my eyes out with a fork. And yet I very much enjoyed playing golf in PHP! Why? Golf, at least for me, is essentially a competition, a highly artificial and stylized skill, more akin to chess than to the craft of writing useful, robust and maintainable computer programs.
Having already golfed this problem in three languages by now, my confidence was rising. I felt surprisingly calm about being a complete PHP ignoramus because I had formed the opinion that far more important than knowing the language is understanding the problem and its algorithms.
Taking the Early Lead in PHP
Tuning my magic formula searcher for PHP brought instant rewards in the form of this straightforward 87 stroker:
This early submission stole the PHP lead from the (presumably Japanese) golfer Hiro Suzuki by two strokes. I was so thrilled at finding my first PHP golfing trick of shortening the leading <?php to <? that I totally missed the routine golfing trick of shortening the trailing ?> to semicolon (;), Eugene's advice of "Can't possibly work, try it anyway" having been temporarily forgotten.
Also of note in this solution is the same exploitation of virtual machine evaluation order that I had used previously in Perl and Ruby, to eliminate the "previous value" variable. To get this to work, I had to use +$n rather than a bald $n. Curiously, this behaviour varied between PHP versions, the + being required only with later PHP versions.
Notice that PHP, like Ruby, but unlike Perl and Python, requires you to deal somehow with the trailing newline annoyance, making magic formulae harder to find, and therefore likely to be longer. So it proved here with the two magic formulae above being a touch longer than the early ones I had found for the other languages. And I found the spaces around the .E. intensely irritating -- though, like Perl, but unlike Ruby and Python, mercifully, I didn't need to quote the E.
Breaking the 80 Barrier
Though delighted to be leading the PHP pack, I was horrified by the length of those two magic formulae. What to do?
It's as easy as 1, 2, 3
It occurred to me that in this PHP solution I could replace those two ugly magic formulae with three prettier ones by fiddling with the expression controlling the while loop like so:
Though this required a complete rewrite of my magic formula searcher, it was worth it because it shaved three strokes. Note that, because the while loop expression now eliminates the trailing newline (newline has an ord of 10, so 1230%ord evaluates to zero and terminates the loop), I no longer need to map the newline in the two magic formulae in the loop body. Note too that the intolerable spaces around the E have been eliminated. Yay! Down to 84 and leading by five now. Despite the overall ugliness of the language, unearthing these tactical tricks was making PHP golfing lots of fun.
Applying some further tricks picked up golfing the problem in the other languages, combined with improved magic formulae, and finally remembering good ol' Eugene to find the trailing ; hack, enabled me to whittle this approach all the way down to 75 strokes:
Variety being the spice of life, notice that I also replaced echo$t; with <?=$t; ... though this did not change the golf score one iota. Leading by 14 now. Where to go from here?
The Magic of a Built-in md5 Function
All my moaning about the inelegance of 3000+ built-in functions ceased the instant I stumbled upon the built-in md5 and sha1 functions. These functions were perfect for magic formulae! Why had I not spotted them sooner?
The md5 function, you see, enjoys a fundamental advantage over ord in that you can use all 256 characters in a magic formula, compared to just ten (0-9) for ord (for example, 68040%ord in the magic formula above). To illustrate, in a six character string, there are just 10**6 combinations available for ord, while there are 256**6=2.8*10**14 combinations available for md5.
There is a catch though. To enjoy all 256 characters you must quote the string, which costs you two strokes. Using Eugene's "can't possibly work, try it anyway" approach, however, revealed that PHP seems to treat characters in the ord range 127-255 as "alphabetic". Anyway, you don't need to quote them, making md5 a certain winner over ord in any magic formula race.
To put all this to the proof, I whittled my 75 stroker by three strokes by replacing ord with md5. Here are some of them:
where XXX above is chr(115).chr(205).chr(69), XXXXX is chr(225).chr(246).chr(180).chr(162).chr(188), and XXXX is chr(174).chr(110).chr(204).chr(142).
The observant reader will have noticed the comical "multiply by one" above, as in "5E$c"*1%4999. Why on Earth would I waste two strokes like that? Well, my testing revealed that PHP interprets "5E2" as 500 (i.e. scientific notation) if followed by a multiplies * operator, but as 5 if followed by a modulo % operator! AFAICT, this surprising behaviour is undocumented, an accident of implementation. In case you're interested, Perl consistently interprets "5E2" as 500, whether followed by multiplies * or modulo %. Ruby always interprets "5E2".to_i() as 5 (integer) and "5E2".to_f() as 500 (floating point). Python, as usual, is the strictest of the gang of four languages, always interpreting float("5E2") as 500 (floating point), yet emitting an "invalid literal for int()" runtime error for int("5E2").
Though I swore at PHP's eccentricity here, loud claps of applause could be heard, as described in the next section, when I desperately needed PHP to interpret the "04e9d..." md5 string as four, and not as 4,000,000,000.
Down to 72 strokes now and leading by 17. To go lower, I needed to eliminate the "E" exponentiation. But how?
Alone among the four languages, PHP lacks a ** exponentiation operator. This had proved to be only a minor nuisance so far, forcing me to rely instead on .E. like constructs. Yet with the PHP md5 function, and again alone among the four languages, I finally spied an opportunity to get rid of the dratted stroke-consuming exponentiation. All I needed to do was find a lucky md5 direct mapping of M -> 1000, D -> 500, C -> 100, L -> 50, X -> 10, V -> 5, I -> 1. Alas, back of the envelope calculations indicated that this mapping was infeasible, even with md5. To find such a mapping, I estimated the search program would need to run for hundreds of years!
What to do? Well, remembering the M999D499C99L49X9V4I string used in some of my early Perl table lookup solutions, the M -> 999, D -> 499, C -> 99, L -> 49, X -> 9, V -> 4, I -> 0 mapping looked much more promising because the C and X mappings are reduced by one character in length, plus any of abcdef will now produce the required I -> 0 mapping. The running time of this new and improved mapping was estimated to be less than five years. Getting closer now. Just need to speed up the searcher a bit.
Lacking a super computer, I resorted to plugging an assembly language md5 routine from OpenSSL into my C magic formula searcher and let 'er rip. After dutifully chugging away for six months, the searcher finally unearthed a lucky hit, illustrated by the following test program:
Running the above test program produces:
Notice that applying %1858 to 453851 produces the required 499 D mapping, while leaving the other mappings untouched, thus producing the desired mapping M -> 999, D -> 499, C -> 99, L -> 49, X -> 9, V -> 4, I -> 0, and a corresponding 70 stroke solution:
where XXXXXX above is chr(111).chr(178).chr(219).chr(246).chr(172).chr(209). Success at last!
Proving That a Shorter Solution Exists
Though I can't currently demonstrate a shorter PHP solution, I can easily prove that one exists. First, note that there are about 180 unquoted characters that may be combined to form input to the md5 function, namely ord 127-255, a-z, A-Z, _, and 0-9. Now, suppose you could find an exact hit, making the %1858 above unnecessary. That would save five strokes. To have a good chance of finding such a lucky hit, you'd need to unearth around 10000 solutions like the one found in six months above; with that many solutions, there's an excellent chance of one of them happening to have the required 499 D mapping (rather than 453851 as above). The ballpark figure of 10000 is got by estimating the probability of a random md5 signature starting with 499[a-f], which is (1/16)*(1/16)*(1/16)*(6/16)=1/10923. Now, extending the search space from six characters in length to eight increases the number of possible solutions by about 180*180=32400, which should be more than enough solutions to produce one lucky hit. That is, increasing the magic formula string by two strokes saves five. QED.
To have a reasonable chance of finding such a solution with my current search program, however, requires a running time of six months times 10000 = 5000 years! Not even my new low cholesterol diet will allow me to live that long. Because the search is highly parallelizable, however, a cheap super computer may soon be within reach of solving this problem. With the recent development of CUDA and OpenCL, a humble PC containing six or so high-end NVIDIA graphics cards may well be able to solve it today. To quote Ton Hospel: it has to be tried, at least.
I'm exhausted now after writing all this down, and it's long enough already, so I'll sign off for now. In the final installment, I'll give my opinions on how the four languages stacked up against each other in this challenge and provide some statistics on how the languages have compared to each other at the codegolf web site, in terms both of code length and popularity.
Updated 7-may: Minor clarifications to probabilities in "Proving That a Shorter Solution Exists" section.