Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

when $$s =~ m/\G.../gc is too verbose

by stefp (Vicar)
on Feb 02, 2006 at 22:15 UTC ( #527473=perlmeditation: print w/replies, xml ) Need Help??

For manually writing lexers my favorite idiom is $$s =~ m/\G.../gc. In scalar context it permits to advance in a string $$s I want to lex. If it matches the current position, it moves past the match, if not, the position is inchanged, \G permits to anchor the match at the current position. I could also use $$s = s/^...//. It does not cost much because the implementation does not move the string to truncate but just move an internal pointer. But this is immaterial to the following discussion.

A lexer for Parse::Yapp ends up looking like

sub lexer { my($parser)=shift; my $s = $parser->YYData->{INPUT}; # reference to the string to lex m/\G\s+/gc; skip any spaces return ('INT', $1) if $$s =~ m/\G\(d+)/gc; return ('ID', $1) if $$s =~ m/\G([A-Z]\w*)/gc; ... # and it goes on for many tentative matches }
I know that I always match on $$s so why should I restate it at each match. I _had_ to remove these useless $$S !

It took me a long time to realize that I could do it with a typeglob trick :

*_ = $parser->YYData->{INPUT}; # reference to the string to lex
Now $_ is an alias to the string to lex. So I can match on it I and don't need the =~ operator anymore

-- stefp

Replies are listed 'Best First'.
Re: when $$s =~ m/\G.../gc is too verbose
by chromatic (Archbishop) on Feb 02, 2006 at 22:36 UTC

    Clever. I usually put a local in there though, just to avoid trouble.

      Oops, I forgot it. I always localize or lexicalize variables proper to a subroutine That's why I never noticed that $_ is not implicetely localized at the entry of a subroutine contrary to what I thought.

      Sadly, localizing *_ or $_ doesn't play well with reference shuffling of strings with positions.

      sub lexer { (*_) = @_; print $1 if m/\G(A)/gc || m/\G(B)/gc ; } my $a = "AB"; lexer \$a; ; lexer \$a;
      This prints "A" then "B"; If I add a local *_ or a local $_, at the entry of the lexer routine, that does not work anymore. So much for a cool trick.

      -- stefp

Re: when $$s =~ m/\G.../gc is too verbose (for)
by tye (Sage) on Feb 03, 2006 at 05:00 UTC
      This is the trick used by Calc.yp in the Parse::Yapp distribution. It indeed creates an alias but conveys the wrong message because the the block is not really used as a loop.

      -- stefp

        That's why I wished Perl allowed another keyword as yet another synonym for for/foreach — I'd propose "with", for example:
        with($$s) { ... }
        But in the meantime, I've trained myself to actually read/see
        for(SCALAR) { ... }
        with(SCALAR) { ... }

        Chalk it up as another Perl idiom.

        Much like in English, you can use Perl's for() for iterating over a list, iterating via initialization + check + step, or associating a single topic with a block of syntax. So I, without apology, use for() for topicalizing. For you, I won't stop doing this. (: Excuse me for not demonstrating the use of English "for" analogous to init + check + step.

        - tye        

Re: when $$s =~ m/\G.../gc is too verbose
by Anonymous Monk on Feb 02, 2006 at 23:28 UTC
    Maybe I don't understand the problem, but if you really hate typing so much, why not generalize the solution instead?

    Not that I like to generalize things, because I usually end up un-generalizing them a few months later (stupid shifting requirements!), but it seems easier than playing with symbol table manipulations just to save a few keystrokes to me... am I missing something?

    I'm thinking of something roughly along these lines... completely untested and possibly wrong code is below. ;-)

    # make a table of regular expression patterns my %table = ( qr/(\d+)/ => 'INT', qr/([A-Z]\w*)/ => 'ID', .... # more tokens here ); my ($parser) = shift; my $s = $parser->YYData->{INPUT}; my @matches; # any matches found by our re go in here foreach my $re ( keys %table ) { # for each regexp, check to see if it matches, and # put all the captured values in @matches if it does @matches = ( $$s =~ m/\G$re/gc ); # return the appropriate token, and captures... return( $table{$re}, @matches) if (@matches); } # end search for a token match # token not found... put error handling here ...


      Without going to the extremeties of toke.c (the Perl tokenizer), things are usally more complicated than mere pattern matching. One may have to test whatever flags. Otherwise, indeed one could factorize one way or another.

      -- stefp

Re: when $$s =~ m/\G.../gc is too verbose
by ambrus (Abbot) on Feb 03, 2006 at 15:48 UTC
Re: when $$s =~ m/\G.../gc is too verbose
by radiantmatrix (Parson) on Feb 07, 2006 at 17:53 UTC

    Why not just store $$s in a local copy of $_?

    #Either local $_ = $$s; #Or s//$$s/; #tricky.. ;-)

    Actually, in this case, I'd be tempted to alter your approach altogether and use a regex table.

    sub lexer { my ($parser) = shift; my $s = $parser->YYData->{INPUT}; # I don't get your line: 'm/\G\s+/gc; skip any spaces' my %dispatch = ( INT => qr/\G(\d+)/gc, ID => qr/\G([A-Z]\w*)/gc, #.. and so on .. ); while (my ($key, $regex) = each %dispatch) { return ($key, $1) if $$s =~ $regex; } }
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet

      Your second solution is no solution:

      $_ = '!'; $s = \'No'; s//$$s/; print;

      You'd need to empty out $_ first, so the local $_ is the way.

        Absolutely. That's the tricky part. ;-) It will work when $_ is undefined, but not otherwise. Of course, you could always change it to s/.*/$$s/, but still not advisable. More of an obfu trick...

        A collection of thoughts and links from the minds of geeks
        The Code that can be seen is not the true Code
        I haven't found a problem yet that can't be solved by a well-placed trebuchet
      About the copy: before even thinking about positions in strings, using a string copy is a no-no. Copying the string to be parsed for each token is madness.

      About a table for lexing : this is irrelevant to the discussion. Also, lexing can be more complex than matching. Yes, one can insert regular code in regex but that the sign that a table based lexing is not appropriate.

      As I said tye, using for is the right way to alias to $_. I don't like it because in the programming space, for is a loop... for me. :)

      In the natural language space, well, English is not my first language.

      So to paraphrase Churchill, for is the worst solution, but it is the only one.

      Hopefully, like said TimToady, Perl6 will be cleaner.

      -- stefp

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://527473]
Approved by Corion
Front-paged by grinder
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2021-04-17 23:30 GMT
Find Nodes?
    Voting Booth?

    No recent polls found