Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Perl's regex engine causes me distress

by japhy (Canon)
on Sep 21, 2000 at 23:04 UTC ( #33542=perlquestion: print w/replies, xml ) Need Help??

japhy has asked for the wisdom of the Perl Monks concerning the following question:

Here is my string: "A1234567890A". Here is my regex:
m{ A # A (?: [^AB]* # 0 or more non-A and non-B characters . B # any character, then a B )* # this combo, 0 or more times [^AB]* # 0 or more non-A and non-B characters A # an A }x
Now, before you ask, this is related to a) unrolling the loop, and b) reversing a regex. But it's also a very isolated case of the regex engine being a ninny.

This what I personally think the regex engine should do:
BEFORE & AFTER REGEX <> <A01234567890A> A <A> <01234567890A> [^AB]* <A01234567890> <A> . <A01234567890A> <> B FAILED <A0123456789> <0A> . <A01234567890> <A> B FAILED
At this point, Perl should NOT try to do:
<A012345678> <90A> . <A0123456789> <0A> B FAILED
since Perl should KNOW that '0' was matched by [^AB]*, so it can't POSSIBLY match B. Instead, Perl should realize it should give up, and continue:
<A01234567890> <A> [^AB]* <A01234567890> <A> A <A01234567890A> <> FINISHED
This is NOT the case. Perl zips ALL the way back to the first 0 in the string, trying to match .B until it is exhausted, and goes back to the '...890' having been matched by [^AB]*, and it goes to the [^AB]* outside the (?:...)*. This matches nothing, and then the 'A' matches.

My gripe is that Perl should know that if something COULD be matched by [^AB]*, then it CAN'T match 'B'.


Replies are listed 'Best First'.
Re: Perl's regex engine causes me distress
by japhy (Canon) on Sep 21, 2000 at 23:21 UTC
    As for how I got around Perl's unhappiness here, I rolled my own solution:
    m{ A (?: (?>[^AB]*) # match non-A's and non-B's WITHOUT BACKTRACKING (?: [AB] # match A or B next... | # ...or... (?<=.) # ...if there was a character before ) B # match B )* # 0 or more times [^AB]* # you know the rest A }x
    Ick. Why should I have to do that work?

Re (tilly) 1: Perl's regex engine causes me distress
by tilly (Archbishop) on Sep 21, 2000 at 23:14 UTC
    I know of no RE engine that does it, but the idea I outlined in this conversation would enable the above optimization, obviate most needs for loop unrolling, get rid of most possible performance hangs, but wouldn't actually save the whales. Gosh, darn.

    OTOH doing it would take some work and thought, and I am definitely not going to undertake the project any time soon.

Re: Perl's regex engine causes me distress
by japhy (Canon) on Sep 23, 2000 at 07:12 UTC
    Another possible optimization:
    m{ A (?: [^AB]+ | .B )* [^AB]* A }x;

Re: Perl's regex engine causes me distress
by demerphq (Chancellor) on Jan 08, 2007 at 21:45 UTC

    This is sometimes called auto-possesification in regex literature. Ive been looking into implementing it, but its not straightforward to do. Unfortunately. PCRE does it to some extent, but not Perl.


      by the way the possessive notation +<quantifier> (saw it first in the second edition of the owl's book) is cute actually: "E++" reads well, take all you can (of these E) don't backtrack.

      cheers --stephan

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://33542]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2023-09-29 13:00 GMT
Find Nodes?
    Voting Booth?

    No recent polls found