Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Quirky regex bug

by japhy (Canon)
on Jan 07, 2002 at 02:28 UTC ( #136718=perlmeditation: print w/replies, xml ) Need Help??

I fixed a quirky bug in the regex engine this morning, thanks to converter's test case. It was due to an over-active optimization. The optimization is "if the regex starts with .*, pretend it started with ^.*" and it makes sense -- since .* will have exhausted itself by the time the regex has failed, there's no sense in trying to match anywhere later in the string.

However, this happens even if the .* is being captured and referenced later in the regex (a back-reference). That's no good, as shown by the test case: "abc123bc" =~ /(.*)\d+\1/;
We'd like $1 to be "bc", but Perl implicitly anchors this regex to the beginning of the string, and thus fails.

I've patched Perl, but in retrospect, I should make sure I find a backreference too. That shouldn't be too bad, though.

Ruby has this bug too. Both can side-step it with a little trick: put .{0} or (?=) as the first thing in your regex; that way, the dot-star isn't the first thing the regex engine sees. Really.

Python is free from this bug as of its latest release, 2.2.

Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Quirky regex bug
by Corion (Pope) on Jan 07, 2002 at 02:35 UTC

    Just to add a small diagnosis from symptoms without looking at the source code, the problem seems to be * specific, as fixed-width REs work (at least with 5.6.1), like :

    print $1 if 'abcfoobc' =~ /(.*)foo\1/; # works print $1 if 'abcfoobc' =~ /(.*).{3}\1/; # does work, $1 empty print $1 if 'abcfoobc' =~ /(.+)foo\1/; # works as well

    Update: Thanks to chipmunk, I figured out that I was dumb. The following RE also matches :

    print $1 if 'abcfoobc' =~ /(.*)\w*\1/; # does work

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      Yes. From the source (regcomp.c:1851):
      else if (!sawopen && (OP(first) == STAR && PL_regkind[(U8)OP(NEXTOPER(first))] == REG_ANY) && !(r->reganch & ROPT_ANCH) ) { /* turn .* into ^.* with an implied $*=1 */ /* ... */ }
      So you can see it is only for regexes starting with .* -- if you look a few lines up, you see:
      /* Skip introductions and multiplicators >= 1. */ while ((OP(first) == OPEN && (sawopen = 1)) || /* ... */ ) { /* ... */ }
      So it's ignoring the fact that the .* might actually be inside parens.

      My patch consisted of the text "!sawparen &&" in the code block up top. Big patch, I know.

      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://136718]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2016-12-10 06:11 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (160 votes). Check out past polls.