Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

This regex seems to have splattered non-greedy everywhere

by fizbin (Chaplain)
on Aug 10, 2005 at 17:30 UTC ( #482665=perlquestion: print w/ replies, xml ) Need Help??
fizbin has asked for the wisdom of the Perl Monks concerning the following question:

So I was trying to debug a problem I was having with a regular expression in that other language I use professionally and thought "well, let me see if perl's regular expression engine behaves the way I think it should on this". So then I copy it over and, while perl doesn't have the majorly unpleasant O(2**n) time behavior I was seeing with java.util.regex, it does misbehave in a bizarre fashion. Or at least, I think it does. I've narrowed down the test case as much as I can to still see the strange behavior.

The paired-down code attempts to split incoming lines on XX, but only when XX isn't inside a single-quoted string:

#!perl while (<DATA>) { chomp; print "for [$_]:\n"; # split on XX, but only when it's not in # single quotes. while (m% ((?:[^']*? # Some unquoted stuff (?:'[^']*')? # optional quoted stuff )*?) # many times, but non-greedy XX | # field separator, or (.+) %xg) # a non-blank final field { # field value should now be in defined($2)?$2:$1 my ($t1, $t2) = qw(undef undef); $t1 = "[$1]" if defined($1); $t2 = "[$2]" if defined($2); print "\t$t1\t$t2\n"; } } __END__ aXXbXXc abcd a little 'quote XX' quote stuff XX other
Now here's where it gets weird. The trailing fields are coming out one character at a time. Here's the output:
for [aXXbXXc]: [a] undef [b] undef undef [c] for [abcd]: undef [a] undef [b] undef [c] undef [d] for [a little 'quote XX' quote stuff XX other]: [a little 'quote XX' quote stuff ] undef undef [ ] undef [o] undef [t] undef [h] undef [e] undef [r]

It's as though perl had decided that my last + sign in the regular expression should be non-greedy despite the fact that it's not followed by a ?.

What's going on here? It gets even more bizarre:

If I replace the first half with a pattern that really should be equivalent, this behavior goes away:

#!perl while (<DATA>) { chomp; print "for [$_]:\n"; # split on XX, but only when it's not in # single quotes. while (m% ([^']*? # Some unquoted stuff (?:'[^']*' # optionally, quoted followed by [^']*?)*? )# unquoted stuff. non-greedy XX | # field separator, or (.+) %xg) # a non-blank final field { # field value should now be in defined($2)?$2:$1 my ($t1, $t2) = qw(undef undef); $t1 = "[$1]" if defined($1); $t2 = "[$2]" if defined($2); print "\t$t1\t$t2\n"; } } __END__ aXXbXXc abcd a little 'quote XX' quote stuff XX other

To summarize: I have a regular expression match that is of the form m%(foo)XX|(.+)%g, where foo is a slightly complicated expression with no captures. When I run it, I get single character results repeatedly in $2. When I replace foo with a different complicated expression that should be equivalent, I suddenly get multiple characters in $2.

I've verified this behavior with cygwin's 5.8.6 perl and ActiveState's 5.6.1. (build 635)

-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Comment on This regex seems to have splattered non-greedy everywhere
Select or Download Code
Re: This regex seems to have splattered non-greedy everywhere
by pbeckingham (Parson) on Aug 10, 2005 at 17:45 UTC

    It can be easily made to work if you first take out the quoted stuff. Bear in mind that this (and your) code does not handle quoted text that spans lines.

    #! /usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; s/'[^']+'//g; print $_, "\n" for split /XX/; } __END__ aXXbXXc abcd a little 'quote XX' quote stuff XX other



    pbeckingham - typist, perishable vertebrate.
Re: This regex seems to have splattered non-greedy everywhere
by japhy (Canon) on Aug 10, 2005 at 17:46 UTC
    It is a bug, but I'm not sure what's causing it. Running it through the debugger, when it finally gets to the (.+) part, it seems to be under the impression that it can only match one character. Changing the first half of the regex fixes this problem, but the fact is the bug remains.

    Update: By the way, I came up with a way to use split() for this kind of problem -- splitting on a pattern unless you're inside quotes (or the like). It's ugly, and probably not fit for production, but here it is:

    while (<DATA>) { chomp; my $q = 0; # documented for the faint of heart my @fields = split m{ ' # if we match a ' (?{ $q = !$q }) # toggle $q (?!) # and fail (don't split here) | # OR XX # if we match XX (?(?{$q}) # and $q is true (?!) # fail (don't split here) ) # otherwise it succeeds and splits }x; print "[", join("][", @fields), "]\n"; }
    Try that on your data. It's sick.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      That's some impressive abuse of code-eval-during-match, but sadly you're right in that it's not sutiable for production, especially when the real production system is sadly using regular expressions in that other language, and not perl.

      And I'll note that the other language returned the matches I thought it should, although it took time on the order of 2**(size of trailing field). It started getting noticeable when we hit a case where the trailing field was 26 character long...

      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        Bug aside, your regex could use some refactoring. I'd suggest:
        m{ ( [^']*? (?: '[^']*' [^']*? )* ) XX | (.+) $ }x
        The unrolled loop of the first half and the $ anchor of the second half should control the performance.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: This regex seems to have splattered non-greedy everywhere
by hv (Parson) on Aug 11, 2005 at 01:30 UTC

    The core of the first regexp looks like this:

    ( [^']* (stuff)? )*

    Because 'stuff' is optional, this becomes an exponential search: /( [^']* )*/x can match n characters 2n ways (as long as none is a quote character).*

    The alternate regexp looks like this:

    [^']* ( stuff [^']* )*
    and now, because 'stuff' is not optional, the search is no longer exponential.

    This is the cause of the slowdown with the first version in your non-perl code. The same slowdown doesn't happen in perl because the regexp engine has some special tricksy handling for this situation, and if it sees something looking exponential it turns on some caching behaviour to avoid unnecessary repetition. (If you use re qw/debug/; you'll see this:

    Detected a super-linear match, switching on caching...
    in the output when it happens.)

    While I am not currently aware of bugs in that area, I kinda expect that there are some, and I suspect that's what is biting you here; the fact that the second regexp does not suffer the same problem is strong evidence for this. (I can confirm that the behaviour persists in the latest bleadperl.)

    I'd suggest a) sending a perlbug about it, and b) trying real hard to remove exponential behaviour from regexps when you can.

    Hugo

    *Update: to be more strictly correct, /( [^']+ )*/x can match 2n ways, while /( [^']* )*/x can match infintely many ways. But the regexp engine automatically suppresses irrelevant multiple zero-length matches, so the effect is much the same.

      b) trying real hard to remove exponential behaviour from regexps when you can.
      Yes, well, there's the problem of recognizing this before it bites me in the ass in production. I mean, I guess this particular exponential behavior might be something I'll recognize in the future, but is there some automated tool that can recognize when regular expressions have the potential for exponential behavior? (Or is this one of those problems that get as hard as the decideability problem?)
      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

        The general question is: given a regexp fragment like

        ( prefix (meat)* suffix )*
        is is possible for the prefix and suffix to be simultaneously (and repeatedly) empty? If so, then anything that matches /(meat){n}/ will be able to match at least 2^n different ways.

        I'm sure that at the least a subset of exponential regexps should be discernible. I imagine the first stage of such an approach would be to iteratively strip out anything that can be zero-width, in which case it would be rather harder to detect that eg:

        /( (foo)* (?=b) (bar)? )* baz )/x
        is linear. Ask me again when Perl6 has a full grammar implementation - it might be an interesting project to write such a detector targetting a pluggable regexp grammar.

        Note that one generic approach to fixing such regexps is to use the 'cut' operator (?> ...), but as far as I am aware that operator is available only in perl itself and the ever-faithful PCRE. For example:

        /( (?> [^']* ) ('[^']*')? )* /x
        avoids the exponential behaviour by disallowing backtracking into the [^']*.

        Hugo

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://482665]
Approved by pboin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2014-08-20 13:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (114 votes), past polls