I decided to take a stab at optimizing the regular expression itself. Here's what I did. First, I backed out of your regular expression all the possible strings that would match exactly. Then I used those as the basis for new patterns that had no alternation at all. I sorted each pattern to get similarities grouped together. Then I categorized them by common keyword, and further subcategorized them by whether that keyword had additional text after it, or before it (or both). That provided the basis for new alternation logic.
From here on out I did everything programatically: Sorted all of the alternations in order from greatest length to shortest length. Then sorted each of the groupings in order from longest to shortest, again forming another layer of alternations.
The goal was to bite off the biggest possible chunks first, smaller later. And to never have more than two layers of alternation in any given potential match.
Here's the resulting code (and I apologize right now for the fact that I haven't taken advantage of the /x modifier; the regexes were constructed programatically):
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw( cmpthese );
my @patterns = (
'(?:Internal\s+Revenue\s+Service|invaluable\s+practical|revenu
+e\s+recognition|Treasury\s+Department|ETIF\s+accounting|implementatio
+n|administrative|authoritative|interpretive|superceding|regulatory|ap
+plicable|accounting|definitive|government|any\s+ruling|transition|pro
+cedural|receiving|technical|under\s+the|staff\'s|judicial|Treasury|va
+luable|specific|college|related|interim|valued|absent|their|his|SEC|F
+DA|IRS)\s+guidance',
'provided\s+by\s+(?:the\s+Securities\s+and\s+Exchange\s+Commis
+sion|Securities\s+and\s+Exchange\s+Commission|the\s+Financial\s+Accou
+nting|Financial\s+Accounting|the\s+United\s+States|the\s+Secretary|Un
+ited\s+States|Secretary|the\s+SEC|SEC)',
'issued\s+by\s+(?:the\s+Securities\s+and\s+Exchange\s+Commissi
+on|Securities\s+and\s+Exchange\s+Commission|the\s+Financial\s+Account
+ing|Financial\s+Accounting|the\s+United\s+States|the\s+Secretary|Unit
+ed\s+States|Secretary|the\s+SEC|SEC)',
'assumes\s+guidance\s+of\s+(?:the\s+talented\s+team|the\s+comp
+ensation|a\s+talented\s+team|a\s+compensation|the\s+company|the\s+boa
+rd|a\s+company|a\s+board)',
'guidance\s+(?:promulgated\s+thereunder|promulgated|to\s+appro
+ve|technology|and\s+rules|facility|software|in\s+SFAS|system)',
'(?:may\s+contain\s+statements\s+about\s+future\s+events,|if\s
++on\s+negative|suggesting\s+an|Microsoft|rating)\s+outlook',
'according\s+to\s+the\s+guidance\s+contained',
'other\s+guidance\s+(?:issued|under|from)',
'provide\s+guidance\s+to\s+directors',
'current\s+guidance\s+(?:under|from)',
'applicable\s+guidance\s+issued',
'outlook\s+for\s+any\s+rating',
);
my $alternations = '\b' . join( '\b|\b', @patterns ) . '\b';
$main::regex_opt = qr/$alternations/;
$main::regex_old = qr/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\s
++negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abou
+t\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|gu
+idance\s+(?:to\s+approve|facility)
|(?:authoritative|revenue\s+recognition|invaluable\s
++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia
+l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go
+vernment|any\s+ruling|college|absent|\s+his|interim|interpretive|tran
+sition|administrative|procedural|related|applicable|accounting|defini
+tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou
+nting)\s+guidance
|guidance\s+(?:and\s+rules|promulgated(?:\s+thereund
+er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti
+es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta
+ry|United\s+States|Financial\s+Accounting)
|(?:other|applicable)\s+guidance\s+issued|according\
+s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors
+|receiving\s+guidance
|(?:current|other)\s+guidance\s+(?:under|from)|assum
+es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com
+pensation)|guidance\s+(?:system|software|technology)
/xi;
$main::string = do { local $/ = undef; <DATA>; };
sub reopt {
my $counter = () = $main::string =~ m/$main::regex_opt/g;
return $counter;
}
sub reold {
my $counter = () = $main::string =~ m/$main::regex_old/g;
return $counter;
}
print reopt(), "\n";
print reold(), "\n";
cmpthese ( -10, {
reopt => \&reopt,
reold => \&reold,
} );
__DATA__
outlook for any rating
rating outlook
if on negative outlook
Microsoft outlook
suggesting an outlook
may contain statements about future events, outlook
guidance to approve
guidance facility
authoritative guidance
revenue recognition guidance
invaluable practical guidance
valuable guidance
regulatory guidance
technical guidance
under the guidance
staff's guidance
judicial guidance
SEC guidance
FDA guidance
Treasury guidance
Treasury Department guidance
specific guidance
implementation guidance
their guidance
government guidance
any ruling guidance
college guidance
absent guidance
his guidance
interim guidance
interpretive guidance
transition guidance
administrative guidance
procedural guidance
related guidance
applicable guidance
accounting guidance
definitive guidance
superceding guidance
IRS guidance
Internal Revenue Service guidance
valued guidance
ETIF accounting guidance
guidance and rules
guidance promulgated
guidance promulgated thereunder
guidance in SFAS
provided by the SEC
provided by the Securities and Exchange Commission
provided by the Secretary
provided by the United States
provided by the Financial Accounting
provided by SEC
provided by Securities and Exchange Commission
provided by Secretary
provided by United States
provided by Financial Accounting
issued by the SEC
issued by the Securities and Exchange Commission
issued by the Secretary
issued by the United States
issued by the Financial Accounting
issued by SEC
issued by Securities and Exchange Commission
issued by Secretary
issued by United States
issued by Financial Accounting
other guidance issued
applicable guidance issued
according to the guidance contained
provide guidance to directors
receiving guidance
current guidance under
other guidance under
current guidance from
other guidance from
assumes guidance of the company
assumes guidance of the board
assumes guidance of the talented team
assumes guidance of the compensation
assumes guidance of a company
assumes guidance of a board
assumes guidance of a talented team
assumes guidance of a compensation
guidance system
guidance software
guidance technology
And here's the benchmark output:
87
87
Rate reold reopt
reold 1318/s -- -75%
reopt 5189/s 294% --
...and on a more beefy box...
87
87
Rate reold reopt
reold 2585/s -- -74%
reopt 10114/s 291% --
Both were run with Perl v5.16.1.
It would be a much better test if I had your actual input data. Instead, I have all the possible ${^MATCH} terms (in the __DATA__ segment). I'm only verifying that all the possible matches are matching. You'll have to verify that rejections are occurring as they should, by testing against some of your real data.
As you can see, we get almost a 3x performance improvement by optimizing the alternations. But the data set I'm testing with is highly contrived. The only way to really know how much this improves upon your existing solution is to benchmark it with real data. And I still believe there is value in knowing what other bottlenecks you might have, which can be done by profiling your actual code with a real dataset.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.