Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: pattern matching with large regex

by CountZero (Bishop)
on Aug 13, 2005 at 17:48 UTC ( [id://483580]=note: print w/replies, xml ) Need Help??


in reply to pattern matching with large regex

The only way to tell is to time your approach against any of the alternatives. The most obvious alternative would be --of course-- to run your regex-match in a loop and checking each alternative separately.

My feeling however is that this will be much slower.

Another alternative would be to hand-craft the regex by taking into account the same or similar chunks in each of the alternatives. The writing of the regex will take much more time and I'm not sure if such a more elaborate (but shorter) regex will work faster than a simple bunch of alternatives.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Replies are listed 'Best First'.
Re^2: pattern matching with large regex
by Tanktalus (Canon) on Aug 13, 2005 at 17:59 UTC

    Looping will be slower? According to simonm, matching multiple times is faster than a single super-match. Only a benchmark of actual data will show for sure, but I've seen it said on PM many times already that multiple smaller matches are faster than single larger-matches. I'm presuming that simpler FSMs combined with less ability to backtrack has something to do with it. I'm not entirely sure, though.

      Indeed only benchmarking will do, as a lot depends on the data-set (are the regexes anchored or not; is there room for a lot of back-tracking, ...).

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re^2: pattern matching with large regex
by fishbot_v2 (Chaplain) on Aug 13, 2005 at 17:55 UTC

    Or instead of handcrafting it, use something like Regex::PreSuf, which groups substrings by longest prefixes (among other things).

    If you are using a static set of alternations, I would recommend caching this somehow, though, it is rather expensive time-wise. Additionally, it isn't guaranteed to be faster, so benchmark with reasonable data. (see Re^4: removing stop words for example with benchmark.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://483580]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-19 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found