Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: pattern matching with large regex

by Tanktalus (Canon)
on Aug 13, 2005 at 17:50 UTC ( [id://483581]=note: print w/replies, xml ) Need Help??


in reply to pattern matching with large regex

A little more detail on what you're doing with the code would be helpful. For example, are you just testing for existance, or are you extracting pieces of data? Are these regular expressions using metacharacters such as .*[]?, or are they constant strings?

Each of these answers may help us help you optimise your code appropriately. For example, constant strings generally are faster with index than regular expressions. But if you have thousands, and you use a regular expression optimizer of some sort from CPAN, you may be able to get a reasonable state machine for finding your data.

On the other hand if you're trying to extract data, which I kind of doubt, and your regular expressions actually use regexp metacharacters, you're probably best off looping through the list:

my @regexps = load_regexps(); @regexps = map { qr/$_/ } @regexps; # pre-compile 'em all. foreach my $re (@regexps) { if ($text =~ $re) { # do stuff based on match. } }
Here we precompile each one, and then try each one after another. The compiled regular expressions should execute a bit faster - I'm not sure why, but I'm guessing because the state machine is way simpler. Note that if you only check a single chunk of text, you won't save anything by pre-compiling the regular expressions.

Replies are listed 'Best First'.
Re^2: pattern matching with large regex
by Anonymous Monk on Aug 13, 2005 at 21:32 UTC
    Most of the regex strings are constant, a few hundred may contain simple constructs like alternation and character classes: (f?oo|bar|baz|etc)[\w\-]*\.[0-9]{3,}) We only extract the data if it matches. As many have suggested I benchmarked a typical case with the actual data and unless something is wrong the difference is extreme:
    my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]*\.)?($big_strin +g)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@strings){ if($text=~/(stuff?)m0r3(?:[^:]* +\.)?($_)/i){my $match="$1:$2"}}}, ); print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; cmpthese( 0, \%cases);
    Results:
    $text = 4578 characters $big_string = 210724 characters @strings = 10634 items Rate many_small one_large many_small 1.05/s -- -100% one_large 630/s 60089% -- --

      Not having any of the data that you're working with, all I can do is offer suggestions that may or may not help - I can't actually test them out to see that if they don't work, I can keep my mouth shut. ;-)

      So, I'm just curious what happens when you a) use a regexp optimiser from CPAN to "optimise" $big_string (of course, proving that the optimisation didn't break anything would be a bit painful), and b) pre-compile your @strings - e.g.:

      print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; my $big_regexp = Regexp::Optimizer->new()->optimize($bit_string); my @small_regexps = map { qr/$_/i } @strings; my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]*\.)?($big_regex +p)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@small_regexps){ if($text=~/(stuff?)m0r3(? +:[^:]*\.)?($_)/i){my $match="$1:$2"}}}, ); cmpthese( 0, \%cases);
        Pre-compiling @strings had no effect. Inherent laziness prevents me from optimising $big_string since it's plenty fast.
      In your 'one_large' example you get the first match. In 'many_small' you get the last one, try adding a last when you get a match in the for loop and see what happens.
        Nice catch but last won't help here because a match will be the exception. Most of the time we check it all and fail to match, but in production last definitely belongs there.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://483581]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-25 20:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found