Beefy Boxes and Bandwidth Generously Provided by pair Networks httptech
Problems? Is your data what you think it is?
 
PerlMonks  

Quirky regex bug

by japhy (Canon)
on Jan 07, 2002 at 02:28 UTC ( #136718=perlmeditation: print w/ replies, xml ) Need Help??

I fixed a quirky bug in the regex engine this morning, thanks to converter's test case. It was due to an over-active optimization. The optimization is "if the regex starts with .*, pretend it started with ^.*" and it makes sense -- since .* will have exhausted itself by the time the regex has failed, there's no sense in trying to match anywhere later in the string.

However, this happens even if the .* is being captured and referenced later in the regex (a back-reference). That's no good, as shown by the test case: "abc123bc" =~ /(.*)\d+\1/;
We'd like $1 to be "bc", but Perl implicitly anchors this regex to the beginning of the string, and thus fails.

I've patched Perl, but in retrospect, I should make sure I find a backreference too. That shouldn't be too bad, though.

Ruby has this bug too. Both can side-step it with a little trick: put .{0} or (?=) as the first thing in your regex; that way, the dot-star isn't the first thing the regex engine sees. Really.

Python is free from this bug as of its latest release, 2.2.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Comment on Quirky regex bug
Download Code
Re: Quirky regex bug
by Corion (Pope) on Jan 07, 2002 at 02:35 UTC

    Just to add a small diagnosis from symptoms without looking at the source code, the problem seems to be * specific, as fixed-width REs work (at least with 5.6.1), like :

    print $1 if 'abcfoobc' =~ /(.*)foo\1/; # works print $1 if 'abcfoobc' =~ /(.*).{3}\1/; # does work, $1 empty print $1 if 'abcfoobc' =~ /(.+)foo\1/; # works as well

    Update: Thanks to chipmunk, I figured out that I was dumb. The following RE also matches :

    print $1 if 'abcfoobc' =~ /(.*)\w*\1/; # does work

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      Yes. From the source (regcomp.c:1851):
      else if (!sawopen && (OP(first) == STAR && PL_regkind[(U8)OP(NEXTOPER(first))] == REG_ANY) && !(r->reganch & ROPT_ANCH) ) { /* turn .* into ^.* with an implied $*=1 */ /* ... */ }
      So you can see it is only for regexes starting with .* -- if you look a few lines up, you see:
      /* Skip introductions and multiplicators >= 1. */ while ((OP(first) == OPEN && (sawopen = 1)) || /* ... */ ) { /* ... */ }
      So it's ignoring the fact that the .* might actually be inside parens.

      My patch consisted of the text "!sawparen &&" in the code block up top. Big patch, I know.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://136718]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (16)
As of 2014-04-23 11:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (541 votes), past polls