Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Inefficient regex

by sutch (Curate)
on Mar 05, 2003 at 18:40 UTC ( #240652=note: print w/ replies, xml ) Need Help??


in reply to Inefficient regex

This:
(.*?)\s*(.*?)\s*
can be problematic. Since LABEL and PARAMETER must not contain spaces, try replacing that portion with something like:
([^\s]+)\s+([^\s]+)?

I'd also suggest that you change *? to simply *, since * denotes 0 or more occurances.

Probably the biggest problem is the:
(.+?)
which occurs just before:
<%end_\1%>
and matches to the end of the string, which then causes the regex engine to backtrack each time it cannot find each "end" tag.

You may also want to take a look at the book "Mastering Regular Expressions" by Jeffrey E. Friedl. See Mastering Regular Expressions for a review.


Comment on Re: Inefficient regex
Select or Download Code
Re: Re: Inefficient regex
by Thelonius (Curate) on Mar 05, 2003 at 19:40 UTC
    Note:\S+ is the same as [^\s+]
      Thelonius, your note is not correct:

      \s+ - matches one or more space characters
      [^\s+] - matches one character that is not a space nor a '+'
      [^\s]+ - (what you may have meant) matches one or more non-space characters.

Re^2: Inefficient regex (death to dot star)
by tye (Cardinal) on Mar 05, 2003 at 19:45 UTC

    Exactly. In more detail, what the original regex ends up doing goes like this:

    <%start_(.*?)\s*(.*?)\s*%>\n?(.+?)<%end_\1%>\n? $1- -A- $2- -B- $3- C- Quickly matches <%start_ Tries $1 matching "" Tries A matching "" Tries $2 matching "wibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "w" Tries A matching "" Tries $2 matching "ibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "wi" Tries A matching "" Tries $2 matching "bble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C ... Tries $1 matching "wibble" Tries A matching " " Tries $2 matching "XXX" Tries $3 matching to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Succeeds finding one match
    which you can see wastes a lot of time.

    That is one reason why Death to dot star! suggests you use character class instead of . whenever possible in a regex.

                    - tye
      This is great and very clear - thank you!! I've taken on board what you said, experimented by making LABEL long then short to prove the point and sure enough execution times vary wildy. I've altered the regex to now use (\S+) to pick up the LABEL and it now completes in sub-second time. Thanks again!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://240652]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-07-13 16:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (250 votes), past polls