Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Perl's regex engine causes me distress

by japhy (Canon)
on Sep 21, 2000 at 23:04 UTC ( [id://33542]=perlquestion: print w/replies, xml ) Need Help??

japhy has asked for the wisdom of the Perl Monks concerning the following question:

Here is my string: "A1234567890A". Here is my regex:
m{ A # A (?: [^AB]* # 0 or more non-A and non-B characters . B # any character, then a B )* # this combo, 0 or more times [^AB]* # 0 or more non-A and non-B characters A # an A }x
Now, before you ask, this is related to a) unrolling the loop, and b) reversing a regex. But it's also a very isolated case of the regex engine being a ninny.

This what I personally think the regex engine should do:
BEFORE & AFTER REGEX <> <A01234567890A> A <A> <01234567890A> [^AB]* <A01234567890> <A> . <A01234567890A> <> B FAILED <A0123456789> <0A> . <A01234567890> <A> B FAILED
At this point, Perl should NOT try to do:
<A012345678> <90A> . <A0123456789> <0A> B FAILED
since Perl should KNOW that '0' was matched by [^AB]*, so it can't POSSIBLY match B. Instead, Perl should realize it should give up, and continue:
<A01234567890> <A> [^AB]* <A01234567890> <A> A <A01234567890A> <> FINISHED
This is NOT the case. Perl zips ALL the way back to the first 0 in the string, trying to match .B until it is exhausted, and goes back to the '...890' having been matched by [^AB]*, and it goes to the [^AB]* outside the (?:...)*. This matches nothing, and then the 'A' matches.

My gripe is that Perl should know that if something COULD be matched by [^AB]*, then it CAN'T match 'B'.

$_="goto+F.print+chop;\n=yhpaj";F1:eval

Replies are listed 'Best First'.
Re: Perl's regex engine causes me distress
by japhy (Canon) on Sep 21, 2000 at 23:21 UTC
    As for how I got around Perl's unhappiness here, I rolled my own solution:
    m{ A (?: (?>[^AB]*) # match non-A's and non-B's WITHOUT BACKTRACKING (?: [AB] # match A or B next... | # ...or... (?<=.) # ...if there was a character before ) B # match B )* # 0 or more times [^AB]* # you know the rest A }x
    Ick. Why should I have to do that work?

    $_="goto+F.print+chop;\n=yhpaj";F1:eval
Re (tilly) 1: Perl's regex engine causes me distress
by tilly (Archbishop) on Sep 21, 2000 at 23:14 UTC
    I know of no RE engine that does it, but the idea I outlined in this conversation would enable the above optimization, obviate most needs for loop unrolling, get rid of most possible performance hangs, but wouldn't actually save the whales. Gosh, darn.

    OTOH doing it would take some work and thought, and I am definitely not going to undertake the project any time soon.

Re: Perl's regex engine causes me distress
by japhy (Canon) on Sep 23, 2000 at 07:12 UTC
    Another possible optimization:
    m{ A (?: [^AB]+ | .B )* [^AB]* A }x;


    $_="goto+F.print+chop;\n=yhpaj";F1:eval
Re: Perl's regex engine causes me distress
by demerphq (Chancellor) on Jan 08, 2007 at 21:45 UTC

    This is sometimes called auto-possesification in regex literature. Ive been looking into implementing it, but its not straightforward to do. Unfortunately. PCRE does it to some extent, but not Perl.

    ---
    $world=~s/war/peace/g

      by the way the possessive notation +<quantifier> (saw it first in the second edition of the owl's book) is cute actually: "E++" reads well, take all you can (of these E) don't backtrack.

      cheers --stephan

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://33542]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2025-03-24 13:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (64 votes). Check out past polls.

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.