Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Help with speeding up regex

by eversuhoshin (Sexton)
on Aug 10, 2012 at 22:16 UTC ( #986831=perlquestion: print w/ replies, xml ) Need Help??
eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am reading in financial filings to see whether the information I need is in the financial filing. I used regex to get rid of those documents that I do not need to examine. My matching regex uses tons of "or" and counts up false positive words/ phrases. However, using lot of "or" slows perl down so much. I was wondering if there is a way to speed up the matching process. Thank you so much!

here is the code. I slurp the filing and then make into a one long line and then do the matching

$fcount=()=$data=~m/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\ +s+negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abo +ut\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|g +uidance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xig;

Comment on Help with speeding up regex
Download Code
Re: Help with speeding up regex
by CountZero (Bishop) on Aug 10, 2012 at 22:44 UTC
    Check out Regexp::Assemble. It assembles your multiple "or" conditions into a more efficient regex.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Help with speeding up regex
by Kenosis (Priest) on Aug 10, 2012 at 23:19 UTC

    Could you provide some sample text contained in $data? This may help the regex optimization.

Re: Help with speeding up regex
by hbm (Hermit) on Aug 10, 2012 at 23:40 UTC

    I'd look to process multiple filings in parallel, if you aren't already.

    This certainly won't help much, but can these ORs:

    (?:other|applicable)\s+guidance\s+issued (?:current|other)\s+guidance\s+(?:under|from) guidance\s+(?:and\s+rules|promulgated(?:\s+thereunder)?|in\s+SFAS)

    Be combined?

    (?:current|other|applicable)?\s*guidance\s+(?:under|from|issued|and +\s+rules|promulgated(?:\s+thereunder)?|in\s+SFAS)
Re: Help with speeding up regex
by davido (Archbishop) on Aug 10, 2012 at 23:53 UTC

    Have you already profiled with Devel::NYTProf so that you're certain the pattern matching is where you're spending too much time? This is a really worthwhile thing to do; it would be unfortunate spending time focusing on fixing the regular expression only to find you are IO bound. Your hunch may be correct, but it's best to know for sure before diving into an optimization effort.

    If you need to install Devel::NYTProf, install version 4.06 (not 4.07) from CPAN, since v4.07 has a minor bug in the nytprofhtml utility that would prevent you from using that utility to see the results in your browser.

    (There's a simple fix, patch submitted, and a future version will certainly have made the repair.)

    Update:Tim Bunce has released v4.08 now, which patches the problem from v4.07. So assuming it's found its way to your local CPAN mirror, you should be able to install Devel::NYTProf with the simple "cpanm Devel::NYTProf" command, and a few minutes of patience.


    Dave

      How does NYTProf speed up regex?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        He said "so that you're certain the pattern matching is where you're spending too much time? " -- is regex pattern bottleneck, yes or no? Its a good question

      Hellp Davido, Thank you for your kind reply. Sorry but I am using active perl using Komodo, can you help me with utilizing Devel::NYTProf? What do I have to do to assess my script? I am still learning and I didn't fully understand the cspan explanation. Thank you again

        ActiveState has Devel::NYTProf in its ppm4 repository here: http://code.activestate.com/ppm/Devel-NYTProf/. Once installed, you would cd into the target script's directory and execute a one-liner that invokes your script: perl -d:NYTProf some_perl.pl input_file.txt. And after it completes, you can review the results by executing the following statement: nytprofhtml --open (while still in the same directory). You should get a browser window with more useful information than you can shake a stick at.

        My optimized regex is going to help as an optimization of the exact regex you provided. But it's tricky to implement and maintain as your needs continue to evolve. A better solution would be to use threads, or to fork processes. BrowserUk already had some suggestions on how you might implement such a strategy. The beauty of that sort of approach is that you don't have to concern yourself quite as much with how efficient the regular expressions themselves are because you're processing several files in parallel.

        If you end up with a ton of data every day that has to get chewed through before tomorrow, you might look into a Map-Reduce strategy such as with hadoop.


        Dave

Re: Help with speeding up regex
by Anonymous Monk on Aug 11, 2012 at 06:28 UTC
    um, you need more whitespace in that regex, you're already using /x, might as well make the regex readable :)
Re: Help with speeding up regex
by BrowserUk (Pope) on Aug 11, 2012 at 11:23 UTC

    A question? Has that regex ever matched?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Help with speeding up regex
by Kenosis (Priest) on Aug 11, 2012 at 17:08 UTC

    The OP says that it "...counts up false positive words..." so it sounds like some matching is occurring.

Re: Help with speeding up regex
by kcott (Abbot) on Aug 11, 2012 at 18:02 UTC

    All of those alternations are possibly causing the regex engine to do a lot of extra work due to backtracking.

    Consider changing the parts of your regexp that look like

    (?:a|b|c)

    to use independent subexpressions like

    (?>a|b|c)

    This is documented in perlre - Extended Patterns (look for (?>pattern) - it's a fair way down that section).

    You may also find Regexp::Debugger to be a useful tool for visualising what the regex engine is doing.

    -- Ken

Re: Help with speeding up regex
by davido (Archbishop) on Aug 12, 2012 at 09:04 UTC

    I decided to take a stab at optimizing the regular expression itself. Here's what I did. First, I backed out of your regular expression all the possible strings that would match exactly. Then I used those as the basis for new patterns that had no alternation at all. I sorted each pattern to get similarities grouped together. Then I categorized them by common keyword, and further subcategorized them by whether that keyword had additional text after it, or before it (or both). That provided the basis for new alternation logic.

    From here on out I did everything programatically: Sorted all of the alternations in order from greatest length to shortest length. Then sorted each of the groupings in order from longest to shortest, again forming another layer of alternations.

    The goal was to bite off the biggest possible chunks first, smaller later. And to never have more than two layers of alternation in any given potential match.

    Here's the resulting code (and I apologize right now for the fact that I haven't taken advantage of the /x modifier; the regexes were constructed programatically):

    And here's the benchmark output:

    87 87 Rate reold reopt reold 1318/s -- -75% reopt 5189/s 294% --

    ...and on a more beefy box...

    87 87 Rate reold reopt reold 2585/s -- -74% reopt 10114/s 291% --

    Both were run with Perl v5.16.1.

    It would be a much better test if I had your actual input data. Instead, I have all the possible ${^MATCH} terms (in the __DATA__ segment). I'm only verifying that all the possible matches are matching. You'll have to verify that rejections are occurring as they should, by testing against some of your real data.

    As you can see, we get almost a 3x performance improvement by optimizing the alternations. But the data set I'm testing with is highly contrived. The only way to really know how much this improves upon your existing solution is to benchmark it with real data. And I still believe there is value in knowing what other bottlenecks you might have, which can be done by profiling your actual code with a real dataset.


    Dave

Re: Help with speeding up regex
by eversuhoshin (Sexton) on Aug 12, 2012 at 16:41 UTC

    Hello Thank you all so much for the helpful suggestions. I will need some time to fully digest them since I am still learning perl :)

    Basically, my script identifies the number of false positive words related to management guidance. I need to do this so I don't have to go through all the financial filings. So through manual processing, I figured out words that seem to be related to guidance but do not have anything to do with the actual guidance. The regex code that I posted is that list that I compiled. By counting the number of false positive words I know that this filing is irrelevant and I will not have read it later for processing.

    I have changed the code a bit and used File::Map to speed it up but I am not sure if I am doing it right. Also, someone asked if the regex worked. Yes, regex works but it is slow and I am trying to make it faster.

    map_file my($data), $filing; $fcount=()=$data=~m/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\ +s+negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abo +ut\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|g +uidance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xig;

    I am also attaching some sample text

    http://sec.gov/Archives/edgar/data/1011737/0001193125-06-122041.txt

    http://sec.gov/Archives/edgar/data/1012270/0001104659-07-059430.txt

    http://sec.gov/Archives/edgar/data/1016281/0001104659-03-016871.txt

    http://sec.gov/Archives/edgar/data/1166036/0001104659-09-021080.txt

    http://sec.gov/Archives/edgar/data/1019361/0001019361-04-000007.txt

    http://sec.gov/Archives/edgar/data/1013934/0000950136-04-003588.txt

    Thank you all again for everything!

      1. If this is to eliminate false positives, why is it necessary to count all the negative hits?

        Doesn't the presence of just one false hit exclude a document?

        If so, the simplest optimisation might be remove the /g;

      2. Presumably, this is just one example of a generic problem?

        Otherwise, if you'd just left the regex running from the point where you posted your question, until you posted your follow-up, you would have processed a little under 1 million documents of the size of those you've linked.

      3. If it is a generic problem of optimising complex regexes, then you'll need a programmable solution.

        Whilst it may be possible to hand-optimise the supplied regex to cut runtime, you'd then be faced with having to do it all again for the next set of false matches.

      4. Perhaps the next simplest solution would be to multi-task your processing of the documents.

        Spread your load across the 4 cores of a typical current machine and you can cut your processing time to a 1/4.

        Purchase a $100 of Amazon's EC2 time and cut your processing time to 1/100th or less.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Dear Browser, Thank you for your kind reply. Can you tell me how I can do multi-task processing? that would be very very helpful! Thank you again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://986831]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2014-11-26 04:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (162 votes), past polls