Lexical Analyzer with Actions : Continue Development?

As I kept refining and refactoring my code, I discovered that I really needed a lexical analyzer that could perform actions. Not quite Parse::RecDescent, nor as simple as HOP::Lexer. I'm sure that were I in an environment where I could install CPAN modules at will, I would use those; however, as it is I wound up writing my own wee module to suit my other coding tasks.

I would welcome any comments from my esteemed bretheren and sisters, including whether I should continue work on this, if I should CPAN it, or whether I'm simply reinventing the wheel for the n^th time.

Behold the actual code, with a packaged example:

package SimpleLexer;

use strict;
use warnings;

our $VERSION = 1.0;

# Generate a new lexical analyzer from the factory.
sub get_engine {
    my ( $this, $lexer, $init_state ) = @_;
    my $class = ref $this || $this;
    my $self = {
        STATE => ['DEFAULT'],
        LEXER => $lexer,
    };
    $self->{STATE} = [$init_state] if $init_state;
    return bless $self, $class;
}

# Lex the input.
sub lex {
    my ( $self, $text ) = @_;
    my $lexer = $self->{LEXER};
    FOUND_LEX_AGAIN: {
        for my $lex ( @{ $lexer->{ $self->{STATE}[-1] } } ) {
            my ( $regex, $action ) = @$lex;
            if ( $text =~ /\G$regex/gc ) {
                $action->($1||$text, $self);
                redo FOUND_LEX_AGAIN;
            }
        }
    }
}

sub begin_state {
    my ( $self, $state ) = @_;
    push @{$self->{STATE}}, $state;
}

sub end_state {
    my $self = shift;
    pop @{$self->{STATE}};
}

### Standalone ###

sub bold_begin {
    my ( $arg, $lexer ) = @_;
    print "BOLD ";
    $lexer->begin_state('bold');
}

sub bold_end {
    my ( $arg, $lexer ) = @_;
    print " NO BOLD";
    $lexer->end_state;
}

sub main {

    # Our lexer
    my $lexer = {
        DEFAULT => [
            [ qr/<b>/, \&bold_begin ],
            [ qr/<uc>/, sub { $_[1]->begin_state('uppercase') } ],
            [ qr/(.)/s, sub { print $1 } ], # echo
            #[ qr/./s, sub { } ], # no echo
        ],
        bold => [
            [ qr{</b>}, \&bold_end ],
            [ qr/<uc>/, sub { $_[1]->begin_state('uppercase') } ],
            [ qr/(.)/s, sub { print $1 } ], # echo
        ],
        uppercase => [
            [ qr{</uc>}, sub { $_[1]->end_state } ],
            [ qr/(.)/s, sub { print "\U$1" } ], # echo
        ],
    };

    # Usage
    if ( scalar @ARGV < 0 ) {
        print "Usage: $0\n";
        exit(1);
    }

    my $engine = get_engine( __PACKAGE__, $lexer );
    $engine->lex("This is a nifty <b><uc>uppercase</uc> test</b> to se
+e what <uc>this</uc> thing can do.\n");
}

main(@ARGV) unless caller;

my $package = __PACKAGE__;
[download]

Thank you!
-v.

"Perl. There is no substitute."

Comment on Lexical Analyzer with Actions : Continue Development? Download Code

Replies are listed 'Best First'.
Re: Lexical Analyzer with Actions : Continue Development? by ikegami (Patriarch) on Sep 16, 2006 at 07:34 UTC
HOP::Lexer didn't do the trick because what you need (and wrote) is a parser, not just a lexer. Your code has bugs. It doesn't recognie the bold in `<uc><b>...</b></uc>`, and it silently ignores the `</uc>` in `<b>...</uc></b>`. While this is not too hard to fix, you'll notice adding a third tag to your model will be very hard (and introduce lots of redundancy and fragility), and introducing a fourth will be near-impossible. ( 2 tags => 4 hashes in %$lexer 3 tags => 16 hashes in %$lexer 4 tags => 65 hashes in %$lexer 5 tags => 326 hashes in %$lexer ) Part of the problem is that it's not proper (i.e. maintainable, expandable) for each tag to be a seperate token. `<b>` and `<uc>` should consist of the same one (opentag) or three (opentagopen, tagname, tagclose) tokens. I love using `/.../xgc` for lexing (aka tokenizing). See What good is `\G` in a regular expression? for more on this topic. My version below uses this. ( Oops! So does yours! ) My version: use strict; use warnings; my %VALID_ELEMENTS = map { $_ => 1 } qw( b uc ); my %state; my @to_close; sub output_text { my ($text) = @_; $text =~ s/([[:print:]])/$1$1/g if $state{b}; $text = uc($text) if $state{uc}; print($text); } sub open_tag { /\G ( \w+ ) /xgc or die("Expecting tag name\n"); my $ele = lc($1); $VALID_ELEMENTS{$ele} or die("Unknown element name $ele\n"); /\G > /xgc or die("Expecting closing bracket\n"); !$state{$ele} or die("Attempting to open previously opened element\n"); $state{$ele} = 1; push @to_close, $ele; } sub close_tag { /\G (\w+) /xgc or die("Expecting tag name\n"); my $ele = lc($1); $VALID_ELEMENTS{$ele} or die("Unknown element name $ele\n"); /\G > /xgc or die("Expecting closing bracket\n"); $state{$ele} or die("Attempting to close unopened element\n"); my $to_close = $to_close[$#to_close]; $to_close eq $ele or die("Missing closing tag for element $to_close\n"); $state{$ele} = 0; pop @to_close; } sub process { # The following "for" aliases $_ to $text and # provides a target for the upcoming "redo" stmts. for ($_[0]) { /\G <\/ /xsgc && do { close_tag(); redo }; /\G < /xsgc && do { open_tag(); redo }; /\G ( [^<]+ ) /xsgc && do { output_text("$1"); redo }; # Falls through at the end of the string. } } process("This is a nifty <b><uc>uppercase</uc> test</b> to see what <u +c>this</uc> thing can do.\n"); [download]	[reply] [d/l] [select]
Re^2: Lexical Analyzer with Actions : Continue Development? by Velaki (Chaplain) on Sep 18, 2006 at 10:54 UTC
Yes, I noticed that, too -- the exponential code lines to cover the complex tag combinations; hence, a parser is needed at that point. The thing is that the tags, in the order I see them, are much simpler, and the "grammar" I'm using is such that I was hoping to get away with a simple lexer. As for why I'm not using the other tools out there, as you read in my original post, I'm not in an environment where I can install CPAN modules at will. However, thanks for the ideas. From what everyone has said, it appears that not only am I reinventing the wheel, but my wheel is pretty awful. I think my module will find a happy home in the coding graveyard, but might prove useful enough for the project for which it was created. Thanks again, -v "Perl. There is no substitute."	[reply]
Re: Lexical Analyzer with Actions : Continue Development? by tilly (Archbishop) on Sep 16, 2006 at 03:27 UTC
Can you come up with a simple statement of what need you are trying to solve and why the other tools don't solve it? I should also note that HOP::Lexer, like much that comes from the Higher Order Perl book, is deceptively simple. It is very simple, and looks very straightforward, but it is part of an incredibly flexible framework. You may find that with that you can accomplish whatever you wanted, even though at first glance it looks too simplistic. Oh, and if you're looking for some more ideas, you might look at Functional take 2. (And the parent node as well, of course.)	[reply]
Re: Lexical Analyzer with Actions : Continue Development? by ioannis (Abbot) on Sep 16, 2006 at 04:56 UTC
The Internet is full of lexical analyzers, as this was one of the favorable hobbies before the arrival of html and windows widgets: you are re-inventing the wheel. How about lex(1), flex(1), or even a full parser like RecDescent (if you don't mind the slow speed.) Lots of analyzers out there. Many parsers can also act as lexers, and some (rich) lexers like flex(1) can almost act like parsers. There is also a perl binding to flex(1) through my cpan module Parse::Flex.	[reply]


Do you know where your variables are?
	PerlMonks