Inefficient regex

Wibble has asked for the wisdom of the Perl Monks concerning the following question:

The content below is stored in a string and parsed with the regular expression following it:

$str = <<EOFSTR;
content
content
<%start_wibble XXX%>
    content
    content
    <%start_flub YYY%>
        content
        content
        content
    <%end_flub%>
    content
    content
<%end_wibble%>
content
content
EOFSTR
[download]

$str =~ s/<%start_(.*?)\s*(.*?)\s*%>\n?(.+?)<%end_\1%>\n?/myfunc($2, $3)/segi;

The content is blocked with <%...%> tags, the entry of which is marked
with "<%start_LABEL PARAMETER%>" and the exit of a block is marked
with "<%end_LABEL%>". Whenever the regular expression encounters such
a block it executes myfunc passing in the PARAMETER and the content
enclosed by the block. Myfunc operates on the content passed in
according to the value of the PARAMETER passed.

The code above works fine, but the regular expression can take 30+
seconds when $str contains 500K of characters. Is this normal or
is there something inherently inefficient with my regular expression?
Is it my use of backreferences?

Comment on Inefficient regex Download Code

Replies are listed 'Best First'.
Re: Inefficient regex by sutch (Curate) on Mar 05, 2003 at 18:40 UTC
This: `(.?)\s(.?)\s` can be problematic. Since LABEL and PARAMETER must not contain spaces, try replacing that portion with something like: `([^\s]+)\s+([^\s]+)?` I'd also suggest that you change ? to simply , since * denotes 0 or more occurances. Probably the biggest problem is the: `(.+?)` which occurs just before: `<%end_\1%>` and matches to the end of the string, which then causes the regex engine to backtrack each time it cannot find each "end" tag. You may also want to take a look at the book "Mastering Regular Expressions" by Jeffrey E. Friedl. See Mastering Regular Expressions for a review.	[reply] [d/l] [select]
Re^2: Inefficient regex (death to dot star) by tye (Sage) on Mar 05, 2003 at 19:45 UTC
Exactly. In more detail, what the original regex ends up doing goes like this: <%start_(.?)\s(.?)\s%>\n?(.+?)<%end_\1%>\n? $1- -A- $2- -B- $3- C- Quickly matches <%start_ Tries $1 matching "" Tries A matching "" Tries $2 matching "wibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "w" Tries A matching "" Tries $2 matching "ibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "wi" Tries A matching "" Tries $2 matching "bble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C ... Tries $1 matching "wibble" Tries A matching " " Tries $2 matching "XXX" Tries $3 matching to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Succeeds finding one match [download] which you can see wastes a lot of time. That is one reason why Death to dot star! suggests you use character class instead of . whenever possible in a regex. - tye	[reply] [d/l]
Re: Re^2: Inefficient regex (death to dot star) by Wibble (Beadle) on Mar 05, 2003 at 21:07 UTC
This is great and very clear - thank you!! I've taken on board what you said, experimented by making LABEL long then short to prove the point and sure enough execution times vary wildy. I've altered the regex to now use (\S+) to pick up the LABEL and it now completes in sub-second time. Thanks again!!!	[reply]
Re: Re: Inefficient regex by Thelonius (Priest) on Mar 05, 2003 at 19:40 UTC
Note:`\S+` is the same as `[^\s+]`	[reply] [d/l] [select]
Re: Re: Re: Inefficient regex by sutch (Curate) on Mar 05, 2003 at 20:14 UTC
Thelonius, your note is not correct: \s+ - matches one or more space characters [^\s+] - matches one character that is not a space nor a '+' [^\s]+ - (what you may have meant) matches one or more non-space characters.	[reply]
Re: Inefficient regex by dws (Chancellor) on Mar 05, 2003 at 19:07 UTC
I assume that you're recursively decomposing the string from with myfunc(). If that's the case, then consider what's happening as you process a large string. Each time you match and substitute, you're building a new large string out of pieces of the old large string. When you do several substituions on a 500K string, you'll get a lot of memory shuffling. An alternate approach is to break the input string into fragments, operate on the fragments, and then reassemble (i.e., join) an output string from fragments when you're finished. My favorite tools for breaking a large string into pieces are the \G assertion, coupled with /c. Find both in perlre.	[reply]
Re: Inefficient regex by diotalevi (Canon) on Mar 06, 2003 at 16:45 UTC
I'm immediately reminded of my answer in Re^2: REGEX different on Linux & Win32!. Its worth looking at as an alternate way of reading this sort of string. Seeking Green geeks in Minnesota	[reply]


Perl Monk, Perl Meditation
	PerlMonks