Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Inefficient regex

by Wibble (Beadle)
on Mar 05, 2003 at 18:17 UTC ( [id://240648]=perlquestion: print w/replies, xml ) Need Help??

Wibble has asked for the wisdom of the Perl Monks concerning the following question:

The content below is stored in a string and parsed with the regular expression following it:
$str = <<EOFSTR; content content <%start_wibble XXX%> content content <%start_flub YYY%> content content content <%end_flub%> content content <%end_wibble%> content content EOFSTR

$str =~ s/<%start_(.*?)\s*(.*?)\s*%>\n?(.+?)<%end_\1%>\n?/myfunc($2, $3)/segi;

The content is blocked with <%...%> tags, the entry of which is marked
with "<%start_LABEL PARAMETER%>" and the exit of a block is marked
with "<%end_LABEL%>". Whenever the regular expression encounters such
a block it executes myfunc passing in the PARAMETER and the content
enclosed by the block. Myfunc operates on the content passed in
according to the value of the PARAMETER passed.

The code above works fine, but the regular expression can take 30+
seconds when $str contains 500K of characters. Is this normal or
is there something inherently inefficient with my regular expression?
Is it my use of backreferences?

Replies are listed 'Best First'.
Re: Inefficient regex
by sutch (Curate) on Mar 05, 2003 at 18:40 UTC
    This:
    (.*?)\s*(.*?)\s*
    can be problematic. Since LABEL and PARAMETER must not contain spaces, try replacing that portion with something like:
    ([^\s]+)\s+([^\s]+)?

    I'd also suggest that you change *? to simply *, since * denotes 0 or more occurances.

    Probably the biggest problem is the:
    (.+?)
    which occurs just before:
    <%end_\1%>
    and matches to the end of the string, which then causes the regex engine to backtrack each time it cannot find each "end" tag.

    You may also want to take a look at the book "Mastering Regular Expressions" by Jeffrey E. Friedl. See Mastering Regular Expressions for a review.

      Exactly. In more detail, what the original regex ends up doing goes like this:

      <%start_(.*?)\s*(.*?)\s*%>\n?(.+?)<%end_\1%>\n? $1- -A- $2- -B- $3- C- Quickly matches <%start_ Tries $1 matching "" Tries A matching "" Tries $2 matching "wibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "w" Tries A matching "" Tries $2 matching "ibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "wi" Tries A matching "" Tries $2 matching "bble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C ... Tries $1 matching "wibble" Tries A matching " " Tries $2 matching "XXX" Tries $3 matching to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Succeeds finding one match
      which you can see wastes a lot of time.

      That is one reason why Death to dot star! suggests you use character class instead of . whenever possible in a regex.

                      - tye
        This is great and very clear - thank you!! I've taken on board what you said, experimented by making LABEL long then short to prove the point and sure enough execution times vary wildy. I've altered the regex to now use (\S+) to pick up the LABEL and it now completes in sub-second time. Thanks again!!!
      Note:\S+ is the same as [^\s+]
        Thelonius, your note is not correct:

        \s+ - matches one or more space characters
        [^\s+] - matches one character that is not a space nor a '+'
        [^\s]+ - (what you may have meant) matches one or more non-space characters.

Re: Inefficient regex
by dws (Chancellor) on Mar 05, 2003 at 19:07 UTC
    I assume that you're recursively decomposing the string from with myfunc(). If that's the case, then consider what's happening as you process a large string. Each time you match and substitute, you're building a new large string out of pieces of the old large string. When you do several substituions on a 500K string, you'll get a lot of memory shuffling.

    An alternate approach is to break the input string into fragments, operate on the fragments, and then reassemble (i.e., join) an output string from fragments when you're finished.

    My favorite tools for breaking a large string into pieces are the \G assertion, coupled with /c. Find both in perlre.

Re: Inefficient regex
by diotalevi (Canon) on Mar 06, 2003 at 16:45 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://240648]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-19 07:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found