Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Advanced techniques with regex quantifiers

by AnomalousMonk (Archbishop)
on Jul 19, 2015 at 14:54 UTC ( [id://1135365]=note: print w/replies, xml ) Need Help??


in reply to Advanced techniques with regex quantifiers

... I've been experimenting again with using Perl regexes more like grammars ...

My kneejerk reaction (and, I suspect, that of many others) on reading something like this is, "Why not just use a grammar?" At least in Perl 5, regexes can be quite useful when applied to limited grammar parsing tasks, but will quickly run out of steam and fall over (or else become monstrously unwieldy) when pushed beyond a certain limit. I suspect the situation may be different with Perl 6, but I must confess ignorance here. In any event, my impression is that the Perl 6 regex compiler/engine is so radically different from that of Perl 5 that feature back-ports can only be piecemeal and incremental.

Further, It's easy to come up with limited approaches to deal with oddball problems, e.g., dynamically varying quantifier counts:

c:\@Work\Perl>perl -wMstrict -MData::Dump -le "my $s = '04abcdefgh06ABCDEFGH05tuvwxyz'; my @caps = map join('', m{ \A (\d\d) }xms, m{ ([[:alpha:]]{$1}) }xms), $s =~ m{ \G \d\d [[:alpha:]]+ }xmsg ; dd @caps; " ("04abcd", "06ABCDEF", "05tuvwx")
Something like this can take care of an immediate problem, but obviously isn't going to integrate well into an even moderately large parser application — and then we're back to the "get a real parser" stance.

Please don't take these remarks as an attack on your interest in the problem. I share and applaud that interest and will upvote the OP as soon as I finish posting this. I love regexes too, but we must learn to recognize the limitations even of those we love most.


Give a man a fish:  <%-(-(-(-<

Replies are listed 'Best First'.
Re^2: Advanced techniques with regex quantifiers [example use-case]
by smls (Friar) on Jul 19, 2015 at 23:04 UTC

    Good points.

    It's just that when balancing elegance & performance & dependencies, in some cases I end up favouring the "one big regex" approach - assuming I can work around its technical limitations.

    For example, in one current use-case I want to search a large input stream for occurrences of any of several "things", and extract relevant information from each match. This is not exactly a "parser" use-case (since I don't care about most of the input stream, and don't want to build a unified AST from it) but it shares some similarities.

    In my prototype design (which seems to be working out fine so far), I do it by defining an array entry for each "thing" like so:

    my @things = ( { parse => qr/ regex0 /sx, action => sub { ... } }, { parse => qr/ regex1 /sx, action => sub { ... } }, ... );

    And then when the program starts, it dynamically compiles all of those individual regexes into a big branching one of the form:

    qr{ (?| regex0 (*MARK:0) | regex1 (*MARK:1) | regex2 (*MARK:2) | ... ) }sx

    ...and starts matching this big regex against the input stream using a sliding window approach inspired by this old post by BrowserUk.

    And whenever a match is found, it calls the corresponding action function roughly like so:

    $things[$main::REGMARK]{action}->();

    This approach is attractive to me because:

    • Defining each thing's parsing and processing code separately but closely together, provides the best kind of encapsulation IMO.
    • It is efficient. Especially if each branch starts with a literal part (rather than a capture etc.) – apparently the regex compiler applies an optimization in that case which allows it to blaze through parts of the input stream with no matches.
    • It is cleanly extensible - I can easily add more entries to @things.
    • It has no non-core dependencies.

    Each branch/thing can have its own capture groups etc, and process them in its action function.
    However, if their parsing regexes need more complex quantified parts, then the advanced techniques I described above become necessary in order to be able to keep this design.

    If those techniques were built-in instead of requiring scary embedded code blocks, I think I would use this kind of approach more often.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1135365]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-25 19:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found