Hey gang -

I've been working on a piece of code for a couple of days, and realized that my notes show a nice, clear progression from rough ideas to ordinary, procedural code, to object-oriented design. I figured that some of you might find it useful of look over my shoulder, as t'were, while I shuffle the ideas around.

Preliminaries

background:

The code in question happens to be a longest-match lexer. If you aren't fluent in compiler design, a compiler usually breaks down into four pieces:

a lexer
a parser
a syntax analyzer
and a code generator

The lexer reads the raw input and identifies basic pieces called tokens. You could say that the lexer translates the input from a stream of raw characters to a stream of tokens. The parser makes sure the tokens are arranged properly, and builds a structure called the parse tree. The syntax analyzer makes sure the parse tree is built properly, and shuffles values around so they're easier to turn into code. The result is called a syntax tree. Then the code generator walks the syntax tree and creates the output you wanted.. executable machine code, HTML documents based on XML input, or whatever.

A longest-match lexer finds the longest possible sequence of characters that match some token or other. For contrast, a shortest-match lexer returns the first token it sees, then starts over again. Shortest-match lexers are easier to write, but the parser has to pick up the work of assembling the pieces. Longest-match parsers are a bit more complicated, but they make the parser simpler.

In both cases, the core of a lexer is a state machine.

sermonette:

For those of you who've gotten this far and are mentally composing a reply along the lines of: "Why reinvent the wheel? Just use [insert CPAN module here...]," my answer is:

Because I chose to.

This is research code, not production code. My goal is to explore the problem space, not to deliver a certain feature set by a given date. I'm exercising my problem solving skills, so you may as well ask a jogger why he doesn't just take the bus.

The three cardinal sins of programming are sloppiness, arrogance, and haste. Declaring one solution better than another without bothering to learn the actual problem is, at very least, hasty.

The main event

first cut -- rough ideas

pass 1:

A state machine uses a transition table, which maps state-and-input-character pairs to new states. The machine begins in a state named 'start', reads the next input character, looks up that pair in its transition table, and moves to the new state. Then it gets the next input character and finds the appropriate transition for the current state. That cycle repeats until the machine runs out of input. If the machine finds a state-and-input-character pair for which there is no transition in the table, we call that an error.

If the machine enters a state that completes a token, we say that the machine accepts the sequence of characters it's seen so far. A longest-match lexer keeps track of the accept states until it hits an error, then backs up to the last accept state and creates a token. Then it backs up the input stream to the first unaccepted character, drops back to the start state, and begins again.

And for the sake of vocabulary, a data structure that records the current state of the machine, and the current position of the input stream, is called the machine's configuration.

pass 2:

Now let's try that in a more-or-less procedural way:

Given a state and an input character, call the table to determine the next state. If a transition exists, take it. If the new state accepts, mark this configuration for future reference, then advance the input and start over.

We can call marking a configuration making a memo, so the sequence for an ordinary transition is: move, memo, advance.

If there's no transition, pull up the last memo. build a token and push it onto an output queue. Return the input pointer to the memoized position, move the machine back to the start state, and start again. That means the sequence for an error transition is: token, rewind, restart.

That sequence works, but I've mentioned things like an output queue and an input pointer that I didn't talk about in the previous description. The rough pieces are there, but they still need some tuning.

pass 3:

At the beginning of the loop, the state register holds some value. If a transition exists for the next unprocessed character, advance the state register to the new state. If the new state accepts, make a memo of the current input position and accept type. Then advance the input pointer. This process does not change the output queue.

If no transition exists, call up the last memo. Create a token of the recorded type, and push that token into the output queue. Rewind the input iterator to the recorded position. Return the state register to the start state. This process does change the output queue, but does not advance the input pointer.

That's clearer, so let's give it some structure.

second cut -- procedural pseudocode

pass 4:

    state = start
    transition = transition-for (state, input.current-char)
    
    if (transition.is-good) {
        state = transition.state                # move
        
        if (state.accepts) {                    # memo
            memo.type = state.type
            memo.position = input.position
        }
        advance (input)                         # advance
    } else {
        push (output-queue, token (memo.type))  # token
        input.position = memo.position          # rewind
        state = start                           # restart
    }
[download]

Better still. It isn't anywhere close to working code, but the ideas are getting clearer.

pass 5:

One thing that doesn't work is the input. We don't have anything that will tell us what string was accepted. We need the range of characters from the start state to the last accept state.

In other words, we need to store the position of the last start state, and the position of the last accept state.

    state = start
    last-start = input.position
    transition = transition-for (state, input.current-char)
    
    if (transition) {
        state = transition.state                # move
        
        if (state.accepts) {                    # memo
            memo.type = state.type
            memo.position = input.position
        }
        advance (input)                         # advance
    } else {
        push (output-queue, token {             # token
            type = memo.type
            value = input.character-range (
                last-start
                memo.position
            )
        })
        input.position = memo.position + 1      # rewind
        state = start                           # restart
        last-start = input.position
    }
[download]

And I think that does everything we need for a single input character. There are still some things we can tune, and the outer loop isn't there yet, but the core ideas seem to be in the right places.

pass 6:

Now let's add the outer loop:

    state = start
    memo.last-start = input.position
    
    while (input.has-more) {
        transition = transition-for (state, input.current-char)
        
        if (transition.is-good) {
            state = transition.state                    # move
            
            if (state.accepts) {                        # memo
                memo.type = state.type
                memo.last-accept = input.position
            }
            advance (input)                             # advance
        } else {
            push (output-queue, token {                 # token
                type = memo.type
                value = input.character-range (
                    last-start
                    memo.position
                )
            })
            input.position = memo.position + 1          # rewind
            state = start                               # restart
            last-start = input.position
        }
    }
[download]

That's even more complete. It's still a far cry from working code, and the boundaries between values and functions have gotten fuzzy, but it isolates responsibilities pretty well.

Now, I think we can submerge the memo in the input object.. at least part of it. We can probably move last-start and last-accept into the input object, since they're both related to the input position.

pass 7:

Let's move the position memos to the input object. That should take some code out of this block and make things easier to read.

    state = start
    input->remember-start-position
    
    while (input.has-more) {
        transition = transition-for (state, input.current-char)
        
        if (transition.is-good) {
            state = transition.state                    # move
            
            if (state.accepts) {                        # memo
                memo.type = state.type
                input->remember-accept-position
            }
            input->advance                              # advance
        } else {
            push output-queue, token (                  # token
                memo.type
                input.accepted-string
            )
            input->rewind                               # rewind
            state = start                               # restart
        }
    }
[download]

As you might notice, the notation has started to migrate toward object-oriented form. The dots represent values, and the arrows represent functions with no return value. I could convey the same ideas with functional notation.. using advance(input) rather than input->advance for instance, but the OO notation works, too.

pass 8:

Since things are starting to become objects, let's make the state responsible for finding its own transitions:

    state = start
    input->remember-start-position
    
    while (input.has-more) {
        if (state.has-transition-for (input.current)) {
            state = state.transition-for (input.current-char)   # move
            
            if (state.accepts) {                        # memo
                last-accept = state
                input->remember-accept-position
            }
            input->advance                              # advance
        } else {
            push (output-queue, token {                 # token
                type = last-accept.type
                value = input.accepted-string
            })
            input->rewind                               # rewind
            state = start                               # restart
            input->remember-start-position
        }
    }
[download]

This works, but there are a couple of things that bug me. First, the process of finding the transition is redundant. Second, the notation is starting to get on my nerves. Some of the values look more like return values from object methods. So from now on, I'm going to assume that everything is a function, and replace the dots with arrows.

pass 9:

Let's get rid of that redundancy:

    state = start
    input->remember-starting-position
    
    while (input->has-more) {
        if (t = state->transition-for (input->current-char)) {
            state = t->new-state                        # move
            if (state->accepts) {                       # memo
                last-accept-state = state
                input->remember-accept-position
            }
            input->advance                              # advance
        } else {
            push (output-queue, token {                 # token
                type = last-accept-state->type
                input->last-accepted-string
            })
            input->rewind                               # rewind
            state = start                               # restart
            input->remember-starting-position
        }
    }
[download]

Again, it works, but setting temporary variables in a conditional bugs me almost as much as looking at the same information twice. The conditional itself is the problem, so let's start shuffling things around so we can get rid of it.

pass 10:

The first thing to do is add an error state. As programming techniques go, this is roughly equivalent to moving from roman numerals to positional (aka: Arabic) notation. The null object is conceptually equivalent to the number zero, since it indicates a lack of value, while still being a value.

    state = start
    input->remember-starting-position
    
    while (input->has-more) {
        state = state->transition-for (input->current-char)
            || error-state                              # move
        
        if (state->is-error) {
            push (output-queue, token {                 # token
                type = last-accept-state->type
                value = input->last-accepted-string
            })
            input->rewind                               # rewind
            state = start                               # restart
            input->remember-starting-position
        } else {
            if (state->accepts) {                       # memo
                last-accept-state = state
                input->remember-accept-position
            }
            input->advance                              # advance
        }
    }
[download]

That may not look like much of a change, but it is. The previous conditional had to decide whether state exists or not. Now we know that a state object will always exist, and we're looking for one with a specific value.

third cut -- OO pseudocode

pass 11:

Since we know that we'll always have a state object, we can get rid of the conditional by overriding an object method:

    regular-state::process (input) {
        input->remember-accept-position if (self->accepts)  # memo-2
        input->advance                                      # advance
    }
    
    error-state::process (input) {
        push (output-queue, token {                     # token
            type = last-accept.type
            value = input->last-accepted-string
        })
        input->rewind                                   # rewind
    }
    
    error-state::transition-for (char) {
        return (start)                                  # restart-1
    }

    start-state::process (input) {
        input->remember-starting-position               # restart-2
    }
    

    state = start-state
    
    while (input->has-more) {
        last-accept = state if (state->accepts)         # memo-1
        state->process (input)
        state = state->transition-for (input->current-char)
            || error-state                              # move
    }
[download]

Of course that doesn't work, because the value last-accept is now visible in both the main loop and in the error state's process() method. That's very ugly, and needs to be remedied ASAP.

We've also changed things so our 'move, memo, advance', and 'token, rewind, restart' sequences are out of phase with the loop itself. the 'move' and 'restart-1' steps fall at the end of one pass through the loop, and the 'memo, advance' and 'restart-2' steps happen at the beginning of the next pass. Everything still ends up happening in the correct order, though.

pass 12:

The value last-accept is the only part of the memo that wasn't folded into the input object. There's no technical reason we can't submerge last-accept too, but it does create a naming problem. There's no reason an object named input should know about a state.

Names are important, so let's change the input object to a configuration object. The configuration knows both input position and state, so we get the behavior we wanted under a name that makes sense:

    regular-state::process (config) {
        if (self->accepts) {                            # memo
            config->remember-accept-position
            config->remember-accept-type (self->type)
        }
        config->advance-input                           # advance
    }
    
    error-state::process (config) {
        push (output-queue, token {                     # token
            type = config->accepted-type
            value = config->accepted-string
        })
        config->rewind-input                            # rewind
    }
    
    error-state::transition-for (char) {
        return (start)                                  # restart-1
    }
    
    start-state::process (config) {
        config->remember-starting-position              # restart-2
    }    
    
    
    state = start
    
    while (config->has-more-input) {
        state->process (config)
        state = state->transition-for (config->current-input-char)
            || error-state                              # move
    }
[download]

Wrap-up

There's a lot more we could do. For one thing, I don't have any error handling, where 'error' means a stream of input that doesn't match any known token. There's the configuration object, which will probably use our old input object as an internal component. The state objects themselves are a subject unto themselves. I plan to use nondeterministic finite automata, which means each 'state' will actually be a set of states, and the transitions will be generated on the fly by subset construction.

That's all beyond the scope of this post, though. These twelve passes through the loop show how (and why) this batch of code has evolved from rough ideas, through procedural code, to object-oriented design. I hope it can serve as an example for people who are new to OOP, and are at the "okay, I understand encapsulation, inheritance, and all that, now how do I decide what objects to make?" stage.

Comment on evolving an OO design Select or Download Code

Replies are listed 'Best First'.
Re: evolving an OO design by rinceWind (Monsignor) on Nov 20, 2002 at 11:08 UTC
mstone++, you have given us all an excellent insight into your thought processes. I recall my own introduction to OO, which was through Perl. I also remember several years of ignoring OO, as I did not see the point of it. Then I tried applying objects and classes to a problem, piecemeal, and arriving at a solution I was pleased with at the time, but I would probably recoil in horror at if I were to see the code now. I do not wish to be too scathing on your approach, merely to shed some light on accepted wisdoms. Firstly, your approach is bottom up. Most successful Object Oriented Design starts with defining the problem, working out the entities and relationships, deciding on which classes to use, then designing the code. The second piece of accepted wisdom is to look at what is out there already. For those of you who've gotten this far and are mentally composing a reply along the lines of: "Why reinvent the wheel? Just use [insert CPAN module here...]," my answer is: Because I chose to. In your case, this is forgiveable if it is purely a learning exercise to teach yourself about OO. However, you can learn a tremendous amount from the modules that others have already written. Although you may reject many of the modules as being unsuitable for your particular problem, knowledge of their existence could prove useful on a future occasion. The namespace DFA:: (Discrete Finite-state Automaton) contains some modules which do very similar stuff, and Parse::RecDescent provides a tool to build complete grammars, and according to TheDamian, most of the capabilities of this module are going into the Perl 6 language. regarding OO, there are a large number of modules in the Class:: namespace. This is a case of TAMWTDI (there are many ways to do it). Finally, consider gaining an understanding of design patterns. I am not advocating following the Gang of Four book¹ religiously - this is Perl after all, and you are allowed niceties like variable argument lists and AUTOLOAD. But, an appreciation of design patterns will help you get the right level of abstraction and decoupling to solve problems without tears, and give you a library of ways of thinking about problems. There are also other more accessable books on design patterns, and I would recommend getting one of these first, before attempting the GoF. ¹Design Patterns : Elements of Reusable Object-Oriented Software Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides. (Addison Wesley, ISBN ISBN 0201633612)	[reply]
Re(2): evolving an OO design by mstone (Deacon) on Nov 22, 2002 at 21:43 UTC
My apologies for waiting so long before posting a reply.. I just did a couple of 22-hour days back to back (as a stagehand, not as a coder). Firstly, your approach is bottom up. Most successful Object Oriented Design starts with defining the problem, working out the entities and relationships, deciding on which classes to use, then designing the code. I tend to side with Knuth with regard to the top-down/bottom-up debate: both ideas are useful, but the best software design tends to happen from the middle out. Check out Literate Programming (D.E. Knuth; CSLI Publications; ISBN: 0937073814) for a wonderful treatment of the subject. Both the top-down and bottom-up methods give programmers ways to sweep bad decisions under the rug. Top-down design shares the same basic weakness as testing, in that you can only design (or test) for the problems you know about. Checking your top-down design won't tell you about missing assumptions, or assumptions that contradict each other in subtle ways. Bottom-up design, OTOH, tends to fall prey to circular logic. Every piece of code ends up needing to be the client for some chunk of information, but nobody wants to take responsibility for it. You need bottom-up information to make sure your top-down design is both complete and self-consistent, and you need top-down design to keep your bottom-up pieces from wrapping themselves up into a snakepit of mutual dependencies. In your case, this is forgiveable if it is purely a learning exercise to teach yourself about OO. Minor misunderstanding: I'm not doing this to learn about OOPerl.. I've been doing that for years. I'm doing it to learn about compilers and parsers, specifically in terms of applying my own OO preferences to the job of building a compiler. I have read Parse::RecDescent, and I think it's a wonderful piece of code. I'm just exploring my own ideas about how to partition responsibilities. Finally, consider gaining an understanding of design patterns. grin I'm already a Patterns junkie. My copy of GoF is well-worn, as are Fowler's Analysis Patterns and Pattern Languages of Program Design vols 1-4. I also have architect Christopher Alexander's 7-book series on urban design, of which A Pattern Language is volume 3. The 'error state' that I introduced in pass 10 is actually an application of the Null Object Pattern, for instance.	[reply]
Re: Re: evolving an OO design by Schemer (Scribe) on Dec 04, 2002 at 04:10 UTC
.. Parse::RecDescent provides a tool to build complete grammars, and according to TheDamian, most of the capabilities of this module are going into the Perl 6 language. Just for your information there are types of grammar a recursive descent parser (like Parse::RecDescent) can't parse. Among other conditions (which I don't remember quite clearly) a left-recursive grammar can't be parsed by recdescent. A LR(n) parser would be more useful to write a grammar for a full language. The strenght of Parse::RecDescent is in its ease of use to parse simple grammars, which is most of the time enough for general needs. And from what I think I've understood from what I've read (parse that!), Perl 6 will be even more powerful on parsing grammars than Parse::RecDescent, which may make it a really good tool for prototyping small languages.	[reply]
Re: evolving an OO design by Aristotle (Chancellor) on Nov 25, 2002 at 21:33 UTC
I took my time to take this in, because though fairly proficient at objective programming, I tend to use objects as structs on steroids. That is, I abstract data structure manipulation rather than control flow through them. Mostly, I guess, because I'm an autodidact and haven't looked into any formal OO design methodologies. So what I'd like to hear about is how the choice I would have made differently might affect the flexibility and overhead of the design. The point I depart is around pass 8 to 10, where you winnow out some redundancy by introducing an error state. At that point, I would have submerged the state register in the state object and made the `transition-for` method a mutator, so my personal roadmap might look something like this: `state->start input->remember-starting-position while (input->has-more) { if (state->attempt-transition-for (input->current-char)) { # m +ove if (state->accepts) { # m +emo last-accept-state = state input->remember-accept-position } input->advance # a +dvance } else { push (output-queue, token { # t +oken type = last-accept-state->type value = input->last-accepted-string }) input->rewind # r +ewind state = start # r +estart input->remember-starting-position } }` [download] Applying similar transformations as your following steps, we might arrive at something like the following. (Note: I juxtaposed the token/rewind steps so `type` and `string` could become simple accessors to the current configuration.) `state->start while (config->has_more_input) { if (state->attempt_transition_for(config->current_input)) { # +move config->remember(state->type) if state->accepts # +memo config->input->advance # +advance } else { config->rewind # rewind push (output_queue, token { # token type = config->type value = config->string }) state->start # restart } }` [download] In this case, the states would be represented not as a state subclass each, but rather as a hash (dispatch?) table inside the state class. Some further submerging reduces this down to `config->start while (config->has_more_input) { if (config->state->attempt_transition) { # move + advanc +e config->remember if config->state->accepts # memo } else { config->rewind # rewind + restart push output_queue, config->token # token } }` [download] What specific drawbacks, if any, do you see in this approach? As far as I can tell, both solutions are equivalent on some level, although the type of abstraction differs. Is that so? Makeshifts last the longest.	[reply] [d/l] [select]
Re: evolving an OO design by DapperDan (Pilgrim) on Nov 20, 2002 at 08:17 UTC
I just want to say that this is a gorgeous write-up; best meditation I've seen in a while. Admittedly much of it is over my head right now, but I will be revisiting it repeatedly until I do understand it all. ++	[reply]
Re: evolving an OO design by RedBull (Initiate) on Nov 20, 2002 at 20:33 UTC
well i cant see any problems but i will see to the job of breaking it down and finding where the error is. and i will get back to you a.s.a.p red bull	[reply]


Keep It Simple, Stupid
	PerlMonks

evolving an OO design