I fixed a quirky bug in the regex engine this morning, thanks to converter's test case. It was due to an over-active optimization. The optimization is "if the regex starts with .*, pretend it started with ^.*" and it makes sense -- since .* will have exhausted itself by the time the regex has failed, there's no sense in trying to match anywhere later in the string.

However, this happens even if the .* is being captured and referenced later in the regex (a back-reference). That's no good, as shown by the test case: "abc123bc" =~ /(.*)\d+\1/; We'd like $1 to be "bc", but Perl implicitly anchors this regex to the beginning of the string, and thus fails.

I've patched Perl, but in retrospect, I should make sure I find a backreference too. That shouldn't be too bad, though.

Ruby has this bug too. Both can side-step it with a little trick: put .{0} or (?=) as the first thing in your regex; that way, the dot-star isn't the first thing the regex engine sees. Really.

Python is free from this bug as of its latest release, 2.2.

Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Quirky regex bug
by Corion (Pope) on Jan 07, 2002 at 02:35 UTC

    Just to add a small diagnosis from symptoms without looking at the source code, the problem seems to be * specific, as fixed-width REs work (at least with 5.6.1), like :

    print $1 if 'abcfoobc' =~ /(.*)foo\1/; # works print $1 if 'abcfoobc' =~ /(.*).{3}\1/; # does work, $1 empty print $1 if 'abcfoobc' =~ /(.+)foo\1/; # works as well

    Update: Thanks to chipmunk, I figured out that I was dumb. The following RE also matches :

    print $1 if 'abcfoobc' =~ /(.*)\w*\1/; # does work

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      Yes. From the source (regcomp.c:1851):
      else if (!sawopen && (OP(first) == STAR && PL_regkind[(U8)OP(NEXTOPER(first))] == REG_ANY) && !(r->reganch & ROPT_ANCH) ) { /* turn .* into ^.* with an implied $*=1 */ /* ... */ }
      So you can see it is only for regexes starting with .* -- if you look a few lines up, you see:
      /* Skip introductions and multiplicators >= 1. */ while ((OP(first) == OPEN && (sawopen = 1)) || /* ... */ ) { /* ... */ }
      So it's ignoring the fact that the .* might actually be inside parens.

      My patch consisted of the text "!sawparen &&" in the code block up top. Big patch, I know.

      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;