Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I decided to take a stab at optimizing the regular expression itself. Here's what I did. First, I backed out of your regular expression all the possible strings that would match exactly. Then I used those as the basis for new patterns that had no alternation at all. I sorted each pattern to get similarities grouped together. Then I categorized them by common keyword, and further subcategorized them by whether that keyword had additional text after it, or before it (or both). That provided the basis for new alternation logic.

From here on out I did everything programatically: Sorted all of the alternations in order from greatest length to shortest length. Then sorted each of the groupings in order from longest to shortest, again forming another layer of alternations.

The goal was to bite off the biggest possible chunks first, smaller later. And to never have more than two layers of alternation in any given potential match.

Here's the resulting code (and I apologize right now for the fact that I haven't taken advantage of the /x modifier; the regexes were constructed programatically):

#!/usr/bin/env perl use strict; use warnings; use Benchmark qw( cmpthese ); my @patterns = ( '(?:Internal\s+Revenue\s+Service|invaluable\s+practical|revenu +e\s+recognition|Treasury\s+Department|ETIF\s+accounting|implementatio +n|administrative|authoritative|interpretive|superceding|regulatory|ap +plicable|accounting|definitive|government|any\s+ruling|transition|pro +cedural|receiving|technical|under\s+the|staff\'s|judicial|Treasury|va +luable|specific|college|related|interim|valued|absent|their|his|SEC|F +DA|IRS)\s+guidance', 'provided\s+by\s+(?:the\s+Securities\s+and\s+Exchange\s+Commis +sion|Securities\s+and\s+Exchange\s+Commission|the\s+Financial\s+Accou +nting|Financial\s+Accounting|the\s+United\s+States|the\s+Secretary|Un +ited\s+States|Secretary|the\s+SEC|SEC)', 'issued\s+by\s+(?:the\s+Securities\s+and\s+Exchange\s+Commissi +on|Securities\s+and\s+Exchange\s+Commission|the\s+Financial\s+Account +ing|Financial\s+Accounting|the\s+United\s+States|the\s+Secretary|Unit +ed\s+States|Secretary|the\s+SEC|SEC)', 'assumes\s+guidance\s+of\s+(?:the\s+talented\s+team|the\s+comp +ensation|a\s+talented\s+team|a\s+compensation|the\s+company|the\s+boa +rd|a\s+company|a\s+board)', 'guidance\s+(?:promulgated\s+thereunder|promulgated|to\s+appro +ve|technology|and\s+rules|facility|software|in\s+SFAS|system)', '(?:may\s+contain\s+statements\s+about\s+future\s+events,|if\s ++on\s+negative|suggesting\s+an|Microsoft|rating)\s+outlook', 'according\s+to\s+the\s+guidance\s+contained', 'other\s+guidance\s+(?:issued|under|from)', 'provide\s+guidance\s+to\s+directors', 'current\s+guidance\s+(?:under|from)', 'applicable\s+guidance\s+issued', 'outlook\s+for\s+any\s+rating', ); my $alternations = '\b' . join( '\b|\b', @patterns ) . '\b'; $main::regex_opt = qr/$alternations/; $main::regex_old = qr/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\s ++negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abou +t\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|gu +idance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|interpretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xi; $main::string = do { local $/ = undef; <DATA>; }; sub reopt { my $counter = () = $main::string =~ m/$main::regex_opt/g; return $counter; } sub reold { my $counter = () = $main::string =~ m/$main::regex_old/g; return $counter; } print reopt(), "\n"; print reold(), "\n"; cmpthese ( -10, { reopt => \&reopt, reold => \&reold, } ); __DATA__ outlook for any rating rating outlook if on negative outlook Microsoft outlook suggesting an outlook may contain statements about future events, outlook guidance to approve guidance facility authoritative guidance revenue recognition guidance invaluable practical guidance valuable guidance regulatory guidance technical guidance under the guidance staff's guidance judicial guidance SEC guidance FDA guidance Treasury guidance Treasury Department guidance specific guidance implementation guidance their guidance government guidance any ruling guidance college guidance absent guidance his guidance interim guidance interpretive guidance transition guidance administrative guidance procedural guidance related guidance applicable guidance accounting guidance definitive guidance superceding guidance IRS guidance Internal Revenue Service guidance valued guidance ETIF accounting guidance guidance and rules guidance promulgated guidance promulgated thereunder guidance in SFAS provided by the SEC provided by the Securities and Exchange Commission provided by the Secretary provided by the United States provided by the Financial Accounting provided by SEC provided by Securities and Exchange Commission provided by Secretary provided by United States provided by Financial Accounting issued by the SEC issued by the Securities and Exchange Commission issued by the Secretary issued by the United States issued by the Financial Accounting issued by SEC issued by Securities and Exchange Commission issued by Secretary issued by United States issued by Financial Accounting other guidance issued applicable guidance issued according to the guidance contained provide guidance to directors receiving guidance current guidance under other guidance under current guidance from other guidance from assumes guidance of the company assumes guidance of the board assumes guidance of the talented team assumes guidance of the compensation assumes guidance of a company assumes guidance of a board assumes guidance of a talented team assumes guidance of a compensation guidance system guidance software guidance technology

And here's the benchmark output:

87 87 Rate reold reopt reold 1318/s -- -75% reopt 5189/s 294% --

...and on a more beefy box...

87 87 Rate reold reopt reold 2585/s -- -74% reopt 10114/s 291% --

Both were run with Perl v5.16.1.

It would be a much better test if I had your actual input data. Instead, I have all the possible ${^MATCH} terms (in the __DATA__ segment). I'm only verifying that all the possible matches are matching. You'll have to verify that rejections are occurring as they should, by testing against some of your real data.

As you can see, we get almost a 3x performance improvement by optimizing the alternations. But the data set I'm testing with is highly contrived. The only way to really know how much this improves upon your existing solution is to benchmark it with real data. And I still believe there is value in knowing what other bottlenecks you might have, which can be done by profiling your actual code with a real dataset.


Dave


In reply to Re: Help with speeding up regex by davido
in thread Help with speeding up regex by eversuhoshin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-03-29 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found