http://www.perlmonks.org?node_id=690816


in reply to Parsing with Regexes and Beyond

Good!

Small nit: qr{#.*\n?$}m can be written simply as qr{#.*\n?}

Small problem: 3 + 4 6 5 7 is considered valid by your parser. You need to check to make sure all tokens were absorbed or return an EOF token and check if that's the next token. The latter method would remove the need for "no warnings 'uninitialized';".

A big difference between your parser and PRD is that yours doesn't allow backtracking. With PRD, you can do "ruleA: ruleB | ruleC" without worrying if ruleB and ruleC can start with the same token. With yours, you'd need to factor out the common leading token ("ruleA: token ( ruleB | ruleC )"). But that's probably not a bad thing, since you should do that in PRD anyway for performance reasons. On the plus side, I think that gives PRD an "infinite" lookahead.

I think you could improve performance by generating a single regexp from @token_def (/\G(?>$ws)(?>$tok1(?{...}|$tok2(?{...}|...)/g).

Also, it would be better if the lexer was an iterator instead of doing all the lexing up front. That would decrease the memory footprint.

Replies are listed 'Best First'.
Re^2: Parsing with Regexes and Beyond
by moritz (Cardinal) on Jun 07, 2008 at 09:31 UTC
    Small problem: 3 + 4 6 5 7 is considered valid by your parser. You need to check to make sure all tokens were absorbed or return an EOF token and check if that's the next token. The latter method would remove the need for "no warnings 'uninitialized';".

    Very good catch, thank you. I'll update the tutorial accordingly.

    I think you could improve performance by generating a single regexp from @token_def (/\G(?>$ws)(?>$tok1(?{...}|$tok2(?{...}|...)/g).

    Since (?{...}) is marked as experimental even in perl 5.10.0 and I (re)discovered some bugs in it, I won't use it, at least not in a tutorial at this level. Thanks for the suggestion anyway, it made me think about 5.10's named captures that could be used instead.

    Also, it would be better if the lexer was an iterator instead of doing all the lexing up front. That would decrease the memory footprint.

    Indeed. I decided against it because it slightly increases complexity (or at least makes it harder to understand), but I should at least mention it.

    I guess I'll find some time tomorrow to incorporate your suggestions.

      I won't use it, at least not in a tutorial at this level.

      I decided against it because it slightly increases complexity

      And that's why I didn't code up a solution for the last two tips. They were more along the lines of "Where do we go from here?". I meant to say as much, but I was being rushed away.