Re: pattern matching with large regex

A little more detail on what you're doing with the code would be helpful. For example, are you just testing for existance, or are you extracting pieces of data? Are these regular expressions using metacharacters such as .*[]?, or are they constant strings?

Each of these answers may help us help you optimise your code appropriately. For example, constant strings generally are faster with index than regular expressions. But if you have thousands, and you use a regular expression optimizer of some sort from CPAN, you may be able to get a reasonable state machine for finding your data.

On the other hand if you're trying to extract data, which I kind of doubt, and your regular expressions actually use regexp metacharacters, you're probably best off looping through the list:

my @regexps = load_regexps();
@regexps = map { qr/$_/ } @regexps; # pre-compile 'em all.
foreach my $re (@regexps)
{
  if ($text =~ $re) {
    # do stuff based on match.
  }
}
[download]

Here we precompile each one, and then try each one after another. The compiled regular expressions should execute a bit faster - I'm not sure why, but I'm guessing because the state machine is way simpler. Note that if you only check a single chunk of text, you won't save anything by pre-compiling the regular expressions.

Comment on Re: pattern matching with large regex Select or Download Code

Replies are listed 'Best First'.
Re^2: pattern matching with large regex by Anonymous Monk on Aug 13, 2005 at 21:32 UTC
Most of the regex strings are constant, a few hundred may contain simple constructs like alternation and character classes: `(f?oo\|bar\|baz\|etc)[\w\-]\.[0-9]{3,})` We only extract the data if it matches. As many have suggested I benchmarked a typical case with the actual data and unless something is wrong the difference is extreme: `my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]\.)?($big_strin +g)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@strings){ if($text=~/(stuff?)m0r3(?:[^:]* +\.)?($_)/i){my $match="$1:$2"}}}, ); print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; cmpthese( 0, \%cases);` [download] Results: `$text = 4578 characters $big_string = 210724 characters @strings = 10634 items Rate many_small one_large many_small 1.05/s -- -100% one_large 630/s 60089% -- --` [download]	[reply] [d/l] [select]
Re^3: pattern matching with large regex by Tanktalus (Canon) on Aug 13, 2005 at 23:23 UTC
Not having any of the data that you're working with, all I can do is offer suggestions that may or may not help - I can't actually test them out to see that if they don't work, I can keep my mouth shut. ;-) So, I'm just curious what happens when you a) use a regexp optimiser from CPAN to "optimise" $big_string (of course, proving that the optimisation didn't break anything would be a bit painful), and b) pre-compile your @strings - e.g.: print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; my $big_regexp = Regexp::Optimizer->new()->optimize($bit_string); my @small_regexps = map { qr/$_/i } @strings; my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]\.)?($big_regex +p)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@small_regexps){ if($text=~/(stuff?)m0r3(? +:[^:]\.)?($_)/i){my $match="$1:$2"}}}, ); cmpthese( 0, \%cases); [download]	[reply] [d/l]
Re^4: pattern matching with large regex by Anonymous Monk on Aug 14, 2005 at 07:43 UTC
Pre-compiling @strings had no effect. Inherent laziness prevents me from optimising $big_string since it's plenty fast.	[reply]
Re^3: pattern matching with large regex by lidden (Curate) on Aug 13, 2005 at 21:46 UTC
In your 'one_large' example you get the first match. In 'many_small' you get the last one, try adding a `last` when you get a match in the for loop and see what happens.	[reply] [d/l]
Re^4: pattern matching with large regex by Anonymous Monk on Aug 13, 2005 at 22:10 UTC
Nice catch but `last` won't help here because a match will be the exception. Most of the time we check it all and fail to match, but in production `last` definitely belongs there.	[reply]


"be consistent"
	PerlMonks