Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

lookahead / lookbehind vs other regex methods

by shemp (Deacon)
on Jan 07, 2004 at 22:00 UTC ( [id://319617]=perlquestion: print w/replies, xml ) Need Help??

shemp has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, here for the zeroth time this year - finally.

I've started becoming a big fan of lookahead and lookbehind assertions in regexs, and im starting to try different ways to perform certain tasks. I'm wondering what people have to say regarding what's preferable in certain situations, or if this question degenerates into a programming style holy war.

So, one instance is where you want to remove a dash from a string if its preceeded by a number and succeeded by a letter. (trivial example). Here's a couple ways to do this:

$string =~ s/(\d)-([A-Z])/$1$2/ig; # my old way $string =~ s/(?<=\d)-(?=[A-Z])//ig; # my new way
Unless im missing something, those lines should both accomplish the exact same thing. Im really becoming a fan of the lookahead / lookbehind version, it seems like it should run faster than the $1$2 way, but i know virtually nothing of the underlying workings of the regex engine.

There are other similar situations that i am in limbo over using lookahead / lookbehind, but the trivial cases all boil down to the above (more or less)

So flame me, or offer $0.02 or whatever.

thanks,
sean

Replies are listed 'Best First'.
Re: lookahead / lookbehind vs other regex methods
by BrowserUk (Patriarch) on Jan 07, 2004 at 22:13 UTC

    Avoiding capturing is one of the keys to writing quicker regexes. Using (?:...) rather than (...) when you don't need the capture can make a substantial difference to the speed at which they run, especially if the regex does a lot of backtracking; which is another thing to avoid where possible.

    So using zero-length assertions instead of captures probably has the same benefits. It also makes the purpose of the regex less obscure IMO.

    Note: This is just based my personal experiences of using the re, rather than any insight into the workings of the re.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

Re: lookahead / lookbehind vs other regex methods
by duff (Parson) on Jan 07, 2004 at 22:07 UTC
    $string =~ s/(\d)-([A-Z])/$1$2/ig; # my old way $string =~ s/(?<=\d)-(?=[A-Z])//ig; # my new way

    Unless im missing something, those lines should both accomplish the exact same thing. Im really becoming a fan of the lookahead / lookbehind version, it seems like it should run faster than the $1$2 way, but i know virtually nothing of the underlying workings of the regex engine.

    One way to find out which is faster is to use the Benchmark module that comes with the standard perl distribution. However, if execution speed really isn't that critical, it all boils down being clear to both Perl and the programmer. And if it does what you need, it's more important to be clear to the programmer as that is who will have to look at it and make sense of it in the future.

    Update: quick benchmarks of my own show the look{ahead,behind} method to be faster and significantly so on large strings. I would imagine this is because of the excessive copying inherent in the $1$2 method.

Re: lookahead / lookbehind vs other regex methods
by ysth (Canon) on Jan 08, 2004 at 00:17 UTC
    Often there are other considerations. For instance, if you were doing the s/// on dashes preceeded and followed by digits, what you expect from "1-2-3" would dictate which to use.

    The speed difference is not due so much to capturing or not but to the fact that substituting a constant string (with /g) loops inside the subst opcode, while interpolating a variable builds a loop of other opcodes, just as if you had s///e:

    $ perl -MO=Concise,-exec -we's/(?<=\d)-(?=[A-Z])//ig' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 <$> const[PV ""] s 4 </> subst(/"(?<=\\d)-(?=[A-Z])"/) vK 5 <@> leave[1 ref] vKP/REFC -e syntax OK $ perl -MO=Concise,-exec -we's/(\d)-([A-Z])/$1$2/ig' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v 3 </> subst(/"(\\d)-([A-Z])"/ replstart->4) v 4 <#> gvsv[*1] s 5 <#> gvsv[*2] s 6 <2> concat[t3] sK/2 7 <|> substcont(other->3) sK/1 8 <@> leave[1 ref] vKP/REFC -e syntax OK
    But even so, remember that optimization should be one of the last steps in development. Write the code however is most clear to you; then, only if there is a performance problem, do some profiling and optimization. (But if either way is equally clear and equally easy, obviously go with whichever you think may be faster.)
Re: lookahead / lookbehind vs other regex methods
by ambrus (Abbot) on Jan 09, 2004 at 12:05 UTC

    In the case you mentioned, you can avoid zero-lengths. However, negative look{ahead,behind}s are sometimes more difficult to convert.

    Simple zero-length assertions like \w are present in many other programs using regexps (sometimes called \<,\>), what proves that they are useful and neccessary.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://319617]
Approved by IOrdy
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-19 09:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found