http://www.perlmonks.org?node_id=759963

🏌

Competing in a modern codegolf tournament can be a lonely experience. You play alone. You analyze alone. After an occasional breakthrough, you celebrate alone, your hard-won solution a closely guarded secret. There is no nineteenth hole, no place to share your solutions with your buddies over a beer.

Accordingly, and given the codegolf Roman-to-Decimal course has been open for more than two years now, I've decided to share my solutions to that game here. You'll get to look over my shoulder as I describe the circumstances and thought processes behind these solutions. I hope you'll find my journey through this competition both interesting and instructive, especially so for the serious, aspiring golfer.

Rules of the Game

In this game, you must convert a Roman numeral to its integer value. The input numeral, provided on stdin followed by a single newline, is guaranteed to be in the range I to MMMCMXCIX (1 to 3999). For example, given IV on stdin, your program should print 4 to stdout.

Test Driven Golf

I always like to start a golf game by writing a comprehensive test program. That saves time in the long run, enabling you to easily verify the many and varied solutions you must concoct to be competitive at golf. So, just like any journeyman test-first developer, I began this game by writing such a test program. A little searching on the CPAN quickly uncovered the Roman module, which I used to write the following test program:

use strict; use warnings; use Roman; my $testprog = shift or die "usage: $0 program-file\n"; my $datafile = 'tt.txt'; my $cmd = "perl $testprog <$datafile"; sub build_file { my $contents = shift; open(my $fh, '>', $datafile) or die "open '$datafile': $!"; print $fh "$contents\n"; close($fh); } print "Testing $testprog, size=", -s $testprog, " bytes.\n"; for my $i (1..3999) { my $r = uc(roman($i)); print "i=$i r=$r\n"; build_file($r); my $got = `$cmd`; chomp($got); $got eq $i or die "expected '$i' got '$got'\n"; } print "successful\n";

Since this tests all valid input values (1..3999), any solution that passes this test program should be correct. Unfortunately, there are usually holes (no pun intended) in the test programs used by the codegolf web site because, in their fully automated flavour of golf, the test program must complete in less than four seconds, so it's usually not feasible to exhaustively test all possible inputs. Their test program for this game, however, is a pretty good one because it includes test cases for all six of the nastier edge cases (IV, IX, XL, XC, CD, CM), plus eight random ones, making it improbable for an incorrect solution to be accepted.

Getting Started

In any golf game, I try to get a complete working solution as quickly as possible. This method of golfing boosts morale in that you always have a working solution -- and enjoy frequent positive feedback each time you shorten it.

I'd just finished playing the Fonality Golf Challenge, so I began this game simply by inverting Ton's winning HART magic formula from the Fonality game:

s!.!y$IVCXL426(-:$XLMCDIVX$dfor$$_.=5x$&*8%29628;${"$$_ "}=-$_!egfor-4e3..0;print${<>}
86 strokes. Hmmm. The leading score posted at this early stage was a remarkable 60 strokes by perl internals hacker robin. This hot start made it immediately clear that Ton's astonishing and magical decimal-to-roman formula was going nowhere in this new game, where you must convert the other way. That was a relief to me because I didn't relish a repeat of the ill feeling surrounding the Perl Golf Ethics thread.

Since I'd already downloaded the CPAN Roman module, I next took a look at the algorithm it employed, namely:

sub arabic { my($arg) = shift; isroman $arg or return undef; my($last_digit) = 1000; my($arabic); foreach (split(//, uc $arg)) { my($digit) = $roman2arabic{$_}; $arabic -= 2 * $last_digit if $last_digit < $digit; $arabic += ($last_digit = $digit); } $arabic; }
Though obviously not golfed, I was impressed by this function, especially this line:
$arabic -= 2 * $last_digit if $last_digit < $digit;
because this clever trick enables the algorithm to make one pass through the input string, one character at a time.

Nowadays we have a principle in Perl, and we stole the phrase Huffman coding for it, from the bit encoding system where you have different sizes for characters. Common characters are encoded in a fewer number of bits, and rarer characters are encoded in more bits. We stole that idea as a general principle for Perl, for things that are commonly used, or when you have to type them very often -- the common things need to be shorter or more succinct.

-- Larry Wall

My prior golfing experience told me that any algorithm that essentially just traverses a string one character at a time is usually very competitive. And this very common operation of traversing a string one character at a time should be golf-able in just about any language -- assuming the language designers Huffman-coded common operations, as Larry Wall did.

The primary tactical problems to be solved with this algorithm are how to shorten the code surrounding $last_digit and how to perform the roman2arabic lookup.

Regexes always win (Mtv's law of golf)

Golfing this promising algorithm proved straightforward once I remembered Mtv's law and used a regex with side-effects ($') for the lookup table:

I1V5X10L50C100D500M1e3=~$_,$\-=2*$}x($}<$')-($}=$')for<>=~/./g;print
Notice that this solution uses the well-known golfing trick, invented by Dutch golfing maestro Eugene van der Pijll, of using $\ as the accumulator, allowing you to print it with a bald print, rather than print$a say, thus saving two strokes.

Getting to 68 with very little effort made this first approach look very promising. Indeed, this simple algorithm formed the basis of the winning solutions in all four languages (Perl, Python, Ruby, PHP), albeit with many twisty tactical turns along the way.

An Interesting Diversion

How many different algorithms did you try? Me, I tried it four or five different ways until I stumbled upon one that seemed shorter

-- Expert golfer Jasper offers sound advice to novice golfer petdance in Re^9: Perl Golf Ethics

In golf, it's desirable to have multiple irons in the fire; that is, to have multiple working solutions using different approaches. Why? Because it's often possible to produce a solution shorter than all of them by combining the best ideas from the various approaches. A related lesson I took from this game is the value of writing solutions in all four languages (Perl, Python, Ruby, PHP); the act of doing that often uncovered algorithmic and tactical improvements that could be transferred from one language to another.

So, to broaden possible solutions, I played around with a tr (aka y) transform reminiscent of the Fonality game:

#!perl -lp s#\d#!($\+=$.*$&*(2cmp$'))#eg,$..=0while y/MDCLXVI/CLXVI51/
70 strokes. But I had to work very hard. I spent far too long fiddling with this approach because I found it fascinating. My golfing instincts were telling me, however, that this approach was far less promising than the previous simpler one, so I reluctantly gave it up at this early stage and never came back to it.

Since symbolic references were so effective in the Fonality game, I naturally tried a similar approach early in this game. Using xor, a common golfing technique, the required $I, $V, $X, $L, $C, $D, $M variables can be built fairly easily, as illustrated by the following test program, showv.pl:

#!perl -n ++$I;$$_=$.*=$^F^=7for V,X,L,C,D,M; print"I=$I V=$V X=$X L=$L C=$C D=$D M=$M\n";
Running echo hello|perl showv.pl displays:
I=1 V=5 X=10 L=50 C=100 D=500 M=1000
So far, so good. However, I could find no remotely competitive solution using this approach. For example, though this one does work:
#!perl -p ++$I;$$_=$.*=$^F^=7for VXLCDM=~/(.)/g; s!!($$1<${_&$'}?'-':'+').$$1!eg;$_=eval
it's 87 strokes in length. Ouch. So I also abandoned this interesting approach early in this game. This shows how fickle golf can be: a winning approach in the Fonality game proved to be a complete dud in this very similar one.

Update: Much later I found an improved 74 stroke symref-based solution:

$b=++$I;$$_=$b*=$^F^=7for V,X,L,C,D,M; $\+=$n-$\%$n*2while$n=${+getc};print

Functional approaches never win

Much as I enjoy functional programming, this style rarely wins golf tournaments. For fun, I tried a couple of functional solutions, such as:

use List::Util 'reduce'; print reduce{$a+$b*($b+1<=>$a)}map{/./;$|--&&(M999D499C99L49X9V4I=~$&+ +$')*y///c}reverse<>=~/((.)\2*)/g
Though there are surely shorter functional-style solutions (update: see, for example, this one found much later), I quickly became discouraged and gave up on this approach quite early.

Into the Top Ten

Of course. The true&tested method of "Can't possibly work, let's try it anyway."

-- Eugene van der Pijll during Infix to RPN game

This classic Eugene van der Pijll quote is the single best piece of advice I could give to any aspiring golfer. Remembering Eugene's aphorism helped me time and time again during this game, especially later when I started golfing in languages I hardly knew (Python, Ruby and PHP).

Reverting to my original 68 stroke solution:

I1V5X10L50C100D500M1e3=~$_,$\-=2*$}x($}<$')-($}=$')for<>=~/./g;print
that $} "previous value" variable was starting to get really annoying. It would be nice to get rid of it somehow. But how? I wonder if it's possible to exploit perl's evaluation order somehow. Well, follow Eugene's advice and just try it!
$\+=$'-2*$'*(-$'>I1V5X10L50C100D500M1e3=~$_-$')for<>=~/./g;print
Yes! No more $}. Down to 64 and on the leaderboard now! Woo hoo! At this point, I allowed myself a few punches in the air with my fist and even jumped up from my desk and ran madly around the room celebrating like a soccer player who just kicked a goal. As for why this solution works, I'll leave that as an exercise for the keen student of perl internals.

The top ten, after four weeks of play:

1st 60 robin Perl 2nd 61 arpad Perl 3rd 63 bearstearns Perl 4th 64 kounoike Perl 5th 64 eyepopslikeamosquito Perl 6th 73 flagitious Ruby 7th 73 yvl Ruby 8th 73 bitsweat Ruby 9th 73 jojo Perl 10th 75 shinh Perl

After the euphoria had died down, I realized I was now completely stuck.

What's the rush?

Why is everyone in such a rush?

-- Peter Norvig in Teach Yourself Programming in Ten Years

"Learning Perl in 24 Minutes Unleashed, in a Nutshell for Dummies" is the one I have

-- f00li5h cited at perl.net.au Humour page

My head is hurting, my right eye feels like it's going to pop like a mosquito drinking from an expresso addict with high blood pressure, I want to crawl somewhere damp and dark and quiet and I consider never to touch a keyboard again.

-- `/anick, after hacking all night on the last day of the TPR 02 Perl Golf tournament

Then it dawned on me: what's the rush? With this new form of golf, there's no need to risk golf-induced eye damage, as `/anick unfortunately suffered, in a desperate attempt to shave one more stroke before the game closes. You can simply, Zen-like, just keep chipping away at your solution at your own pace. So I just kept whittling away whenever I had some spare time.

Some months later, I stumbled upon an unusual <=> spaceship operator solution:

$/=\1;$\+=-$'*(-$'<=>I1V5X10L50C100D500M1e3=~$_-$')for<>;print
Notice that, unlike earlier algorithms, this one requires one more iteration than the string length. For this purpose, I utilized the trailing newline, simply changing <>=~/./g to <>=~/./sg initially, at the cost of a single stroke. Later, I got that stroke back by noticing that $/=\1; combined with <> is one stroke shorter.

62 strokes! Only three strokes from the lead now, with "bearstearns" the new leader on 59.

After further months drifted by, I whittled two more strokes:

$\+=$'-2*$'%(I1V5X10L50C100D500M1e3=~$_*$')for<>=~/./g;print
60 strokes! I felt a bit embarrassed at not finding this simple trick sooner. Not to worry, no rush. Only one from the lead and getting excited now.

For some time now I was aware that changing the table lookup string from:

I1V5X10L50C100D500M1e3
to:
M999D499C99L49X9V4I
saved three strokes -- if you could combine it with an expression you had to perform anyway to add one to $' (as was achieved in Re^2: Golf: Magic Formula for Roman Numerals). The trouble is, I could not find an economical way to add one to the shortened lookup string in this game. The best I found is the following 60 stroker:
$\+=$z-2*$z%($z=M999D499C99L49X9V4I=~$_+$')for<>=~/./g;print

As indicated at Golf: Magic Formula for Roman Numerals, I was also playing around with replacing the lookup table with a magic formula. At first, I struggled to fit the magic formulae I had found into a short solution to this course, but eventually wound up with a flood of 60 strokers, such as:

$\+=$z-2*$z%($z=1 .E.(3^77%ord)%7>>y/VLD//)for<>=~/./g;print $\+=$z-2*$z%($z=.5**y/VLD//.E.(3^77%ord)%7)for<>=~/./g;print $\+=$z-2*$z%($z=5**y/VLD//.E.(42^88*ord)%5)for<>=~/./g;print $\+=$z-2*$z%($z=5**y/VLD//.E.71248%ord()%5)for<>=~/./g;print $\+=$z-2*$z%($z=5**y/VLD//.E.(3&57532/ord))for<>=~/./g;print $\+=$z-2*$z%($z=.1.E.(3^85%ord)%7>>y/VLD//)for<>=~/./g;print
I now had many different 60 stroke solutions, but the lead was at 59!

Never give up

But it's not over then ... I pretty soon found both the basic 33-solutions by permuting ways to do that a bit ... In fact, there's an almost 32 solution ... which fails on a stderr message and does not work on arguments that start with a number and contain a letter. But it has to be tried at least. How many of the 33 people did?

-- Golfing legend Ton Hospel showing why he's No. 1; he's always searching for improvements

Alas, I couldn't improve any of these solutions and I've still got no idea what the leading golfers on 59 were doing.

In utter frustration, I took off for a long two hour walk to clear my head. After about an hour of walking, in desperation, I tried a completely different approach. I remember walking along calculating its golf score in my head, again and again, and kept coming up with 58! Surely, it couldn't be. Usually, when I have these "epiphanies", there's something I've overlooked and the solution fails dismally when I actually test it. With trembling fingers, I typed in the 58 stroke solution:

$\+=$z-2*$z%($z=VLD=~$_*5+IXCM=~$_."E@-")for<>=~/./g;print
1, 2, 3, ... 99, 100, ... 3999. Success, all tests pass! Time for another celebration.

For the first time in my life, I'm leading a Perl golf tournament! This solution, requiring no magic formulae at all, is not that hard to find and I remember feeling quite surprised at the time that noone else had found it after more than a year of play.

Magic formulae always win (update to Mtv's law)

When I started writing Python, Ruby and PHP solutions to this problem, I fell back to the magic formula approach, mainly due to my ignorance of general golfing tricks in these languages. Curiously, I was then able to apply the tricks learnt there to improve the Perl magic formula based solutions and so beat the 58 stroke regex-based solution described above.

In the next installment, I'll describe how I golfed this problem in Python and Ruby.

References

References Added Later

Updated 26-apr 2009: Minor improvements to wording. March 2016: Added Larry Wall quote re Huffman coding.