Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Interesting behavior of regular expression engine

by lightoverhead (Monk)
on Mar 12, 2013 at 22:55 UTC ( #1023067=perlquestion: print w/ replies, xml ) Need Help??
lightoverhead has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks:

I found an interesting case of regex. This is actually from Programing Perl 4th errata.
The output of the following line of code is different than what is des +cribed in the text: "abcdef" =~ / .+ (?{say "Matched so far: $&"}) bcdef $/x; The text states that this prints the following: Matched so far: abcdef Matched so far: abcde Matched so far: abcd Matched so far: abc Matched so far: ab Matched so far: a But I am finding that only the last line gets printed: Matched so far: a One solution to get the output that the text describes is to slightly +modify the code example to the following: "abcdef" =~ / .+ (?{say "Matched so far: $&"}) .cdef $/x;
I am wondering why "backtrack" doesn't behave the way originally expected and why the edited version works?

Thank you!

Comment on Interesting behavior of regular expression engine
Download Code
Re: Interesting behavior of regular expression engine
by Anonymous Monk on Mar 12, 2013 at 23:14 UTC

    I am wondering why "backtrack" doesn't behave the way originally expected and why the edited version works?

    Its something the engine does, its a sideeffect, its not something to depend on

Re: Interesting behavior of regular expression engine
by dave_the_m (Parson) on Mar 12, 2013 at 23:16 UTC
    Because the first one is triggering an optimisation. If you do '.+b', you might expect the engine to naively just keep skipping non-newline chars (.) until it can't go any further, then fail to match a 'b', backtrack one position, still fail the 'b', then continue backtracking until only one '.' is consumed, and the following 'b' is matched.

    Instead, the code that does '.+' is optimised to know that it must be followed by a known fixed string (bcdef) and avoids consuming so many characters that the constraint can't be met.

    In the second case it isn't followed by a fixed string, but rather by a '.', so the optimisation isn't triggered, and the engine falls back to a naive series of backtracks.

    Dave.

      Thank you Dave. Your answer is very clear!
Re: Interesting behavior of regular expression engine
by kcott (Abbot) on Mar 12, 2013 at 23:33 UTC

    G'day lightoverhead,

    I ran that code and got the same result as you:

    $ perl -Mstrict -Mwarnings -E ' "abcdef" =~ / .+ (?{say "Matched so far: $&"}) bcdef $/x; ' Matched so far: a

    I then stepped through the code using Regexp::Debugger:

    $ perl -Mstrict -Mwarnings -E ' use Regexp::Debugger; "abcdef" =~ / .+ (?{say "Matched so far: $&"}) bcdef $/x; '

    This showed each of the six "Matched so far: ..." lines in the same order as you have them. It also showed backtracking behaviour as I, and presumably you, would expect.

    The problem seems to lie with just the output; however, I don't know what that problem is.

    -- Ken

Re: Interesting behavior of regular expression engine
by tobyink (Abbot) on Mar 12, 2013 at 23:35 UTC

    I'm no expert, but use re 'debug' seems to offer some insight.

    In both cases the last part of the regexp is the longest floating string, so is the part that Perl attempts to match first.

    In the first case, all that remains is to match "a" against /.+/. This succeeds, and your code returns true (say returns true if it is able to print anything), thus the whole match succeeds. The code only needs to be executed once.

    In the second case, it tries to slot letters into two patterns: /.+/ before the code and /./ after the code, and it has to shuffle them around six times before it gets a success.

    Personally, I think it's a bad idea to rely on code embedded in regexps having side-effects. How many times it will be executed, and in what order is pretty unpredictable; a new Perl release could change it.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

      I have to agree with tobyink on this one. Regexes are supposed to provide stable output given a set of well-defined rules (which thanks to p5p they pretty much always have, without fail). What's not well-defined is what steps the underlying implementation takes to arrive at the output, and that's a Good Thing, as some of the optimizations this enables are quite profound indeed. I don't get the sense that you (the OP) are advocating this weird execution side effect not working as being a bug that needs to be fixed, but nonetheless feel it important to underscore my own strong preference for end-result correctness first, performance second, and weird effects likely to break tomorrow coming in a distant third.

        What's not well-defined is what steps the underlying implementation takes to arrive at the output,

        Yet, "Programming Perl 3rd" lays out 6 very complicated rules describing in detail how the regex engine proceeds (p 197-201). So, Larry at least thinks the steps are/were well defined. I wonder if those steps are in the new edition?
      In both cases the last part of the regexp is the longest floating string, so is the part that Perl attempts to match first.

      I've never seen that be the case and I seriously doubt that it was the case in your test run.

      The only thing I've seen the regex engine do with the "longest floating string" is to use only the length it to estimate the offset where it will begin searching for a match (going left-to-right in the string and left-to-right in the regex).

      - tye        

Re: Interesting behavior of regular expression engine
by Don Coyote (Monk) on Mar 13, 2013 at 08:54 UTC

    I am interested in the use of say within what you describe as a Perl 4th errata construct. From recent discussions on perlmonks Unexpected behavior of function 'say' we have seen that early say was implemented as a method in IO::File through IO::Handle. These modules were introduced to core in Perl5.3.7.

    I am fairly sure I have not seen say used within the Perl4 text I have read. Which text is your example extracted from? And, could the use of say, and/or the implementation of $& be having any affect, aside from backtracking optimisations, on the output?

    oh, errata as in edition, not version

    apologies, please reap...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1023067]
Approved by kcott
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2014-10-31 11:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (216 votes), past polls