|Perl: the Markov chain saw|
evolving an OO designby mstone (Deacon)
|on Nov 20, 2002 at 02:59 UTC||Need Help??|
Hey gang -
I've been working on a piece of code for a couple of days, and realized that my notes show a nice, clear progression from rough ideas to ordinary, procedural code, to object-oriented design. I figured that some of you might find it useful of look over my shoulder, as t'were, while I shuffle the ideas around.
The code in question happens to be a longest-match lexer. If you aren't fluent in compiler design, a compiler usually breaks down into four pieces:
The lexer reads the raw input and identifies basic pieces called tokens. You could say that the lexer translates the input from a stream of raw characters to a stream of tokens. The parser makes sure the tokens are arranged properly, and builds a structure called the parse tree. The syntax analyzer makes sure the parse tree is built properly, and shuffles values around so they're easier to turn into code. The result is called a syntax tree. Then the code generator walks the syntax tree and creates the output you wanted.. executable machine code, HTML documents based on XML input, or whatever.
A longest-match lexer finds the longest possible sequence of characters that match some token or other. For contrast, a shortest-match lexer returns the first token it sees, then starts over again. Shortest-match lexers are easier to write, but the parser has to pick up the work of assembling the pieces. Longest-match parsers are a bit more complicated, but they make the parser simpler.
In both cases, the core of a lexer is a state machine.
For those of you who've gotten this far and are mentally composing a reply along the lines of: "Why reinvent the wheel? Just use [insert CPAN module here...]," my answer is:
Because I chose to.
This is research code, not production code. My goal is to explore the problem space, not to deliver a certain feature set by a given date. I'm exercising my problem solving skills, so you may as well ask a jogger why he doesn't just take the bus.
The three cardinal sins of programming are sloppiness, arrogance, and haste. Declaring one solution better than another without bothering to learn the actual problem is, at very least, hasty.
The main event
first cut -- rough ideas
A state machine uses a transition table, which maps state-and-input-character pairs to new states. The machine begins in a state named 'start', reads the next input character, looks up that pair in its transition table, and moves to the new state. Then it gets the next input character and finds the appropriate transition for the current state. That cycle repeats until the machine runs out of input. If the machine finds a state-and-input-character pair for which there is no transition in the table, we call that an error.
If the machine enters a state that completes a token, we say that the machine accepts the sequence of characters it's seen so far. A longest-match lexer keeps track of the accept states until it hits an error, then backs up to the last accept state and creates a token. Then it backs up the input stream to the first unaccepted character, drops back to the start state, and begins again.
And for the sake of vocabulary, a data structure that records the current state of the machine, and the current position of the input stream, is called the machine's configuration.
Now let's try that in a more-or-less procedural way:
Given a state and an input character, call the table to determine the next state. If a transition exists, take it. If the new state accepts, mark this configuration for future reference, then advance the input and start over.
We can call marking a configuration making a memo, so the sequence for an ordinary transition is: move, memo, advance.
If there's no transition, pull up the last memo. build a token and push it onto an output queue. Return the input pointer to the memoized position, move the machine back to the start state, and start again. That means the sequence for an error transition is: token, rewind, restart.
That sequence works, but I've mentioned things like an output queue and an input pointer that I didn't talk about in the previous description. The rough pieces are there, but they still need some tuning.
At the beginning of the loop, the state register holds some value. If a transition exists for the next unprocessed character, advance the state register to the new state. If the new state accepts, make a memo of the current input position and accept type. Then advance the input pointer. This process does not change the output queue.
If no transition exists, call up the last memo. Create a token of the recorded type, and push that token into the output queue. Rewind the input iterator to the recorded position. Return the state register to the start state. This process does change the output queue, but does not advance the input pointer.
That's clearer, so let's give it some structure.
second cut -- procedural pseudocode
Better still. It isn't anywhere close to working code, but the ideas are getting clearer.
One thing that doesn't work is the input. We don't have anything that will tell us what string was accepted. We need the range of characters from the start state to the last accept state.
In other words, we need to store the position of the last start state, and the position of the last accept state.
And I think that does everything we need for a single input character. There are still some things we can tune, and the outer loop isn't there yet, but the core ideas seem to be in the right places.
Now let's add the outer loop:
That's even more complete. It's still a far cry from working code, and the boundaries between values and functions have gotten fuzzy, but it isolates responsibilities pretty well.
Now, I think we can submerge the memo in the input object.. at least part of it. We can probably move last-start and last-accept into the input object, since they're both related to the input position.
Let's move the position memos to the input object. That should take some code out of this block and make things easier to read.
As you might notice, the notation has started to migrate toward object-oriented form. The dots represent values, and the arrows represent functions with no return value. I could convey the same ideas with functional notation.. using advance(input) rather than input->advance for instance, but the OO notation works, too.
Since things are starting to become objects, let's make the state responsible for finding its own transitions:
This works, but there are a couple of things that bug me. First, the process of finding the transition is redundant. Second, the notation is starting to get on my nerves. Some of the values look more like return values from object methods. So from now on, I'm going to assume that everything is a function, and replace the dots with arrows.
Let's get rid of that redundancy:
Again, it works, but setting temporary variables in a conditional bugs me almost as much as looking at the same information twice. The conditional itself is the problem, so let's start shuffling things around so we can get rid of it.
The first thing to do is add an error state. As programming techniques go, this is roughly equivalent to moving from roman numerals to positional (aka: Arabic) notation. The null object is conceptually equivalent to the number zero, since it indicates a lack of value, while still being a value.
That may not look like much of a change, but it is. The previous conditional had to decide whether state exists or not. Now we know that a state object will always exist, and we're looking for one with a specific value.
third cut -- OO pseudocode
Since we know that we'll always have a state object, we can get rid of the conditional by overriding an object method:
Of course that doesn't work, because the value last-accept is now visible in both the main loop and in the error state's process() method. That's very ugly, and needs to be remedied ASAP.
We've also changed things so our 'move, memo, advance', and 'token, rewind, restart' sequences are out of phase with the loop itself. the 'move' and 'restart-1' steps fall at the end of one pass through the loop, and the 'memo, advance' and 'restart-2' steps happen at the beginning of the next pass. Everything still ends up happening in the correct order, though.
The value last-accept is the only part of the memo that wasn't folded into the input object. There's no technical reason we can't submerge last-accept too, but it does create a naming problem. There's no reason an object named input should know about a state.
Names are important, so let's change the input object to a configuration object. The configuration knows both input position and state, so we get the behavior we wanted under a name that makes sense:
There's a lot more we could do. For one thing, I don't have any error handling, where 'error' means a stream of input that doesn't match any known token. There's the configuration object, which will probably use our old input object as an internal component. The state objects themselves are a subject unto themselves. I plan to use nondeterministic finite automata, which means each 'state' will actually be a set of states, and the transitions will be generated on the fly by subset construction.
That's all beyond the scope of this post, though. These twelve passes through the loop show how (and why) this batch of code has evolved from rough ideas, through procedural code, to object-oriented design. I hope it can serve as an example for people who are new to OOP, and are at the "okay, I understand encapsulation, inheritance, and all that, now how do I decide what objects to make?" stage.