Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Help with speeding up regex

by eversuhoshin (Sexton)
on Aug 12, 2012 at 16:41 UTC ( #986968=note: print w/ replies, xml ) Need Help??


in reply to Help with speeding up regex

Hello Thank you all so much for the helpful suggestions. I will need some time to fully digest them since I am still learning perl :)

Basically, my script identifies the number of false positive words related to management guidance. I need to do this so I don't have to go through all the financial filings. So through manual processing, I figured out words that seem to be related to guidance but do not have anything to do with the actual guidance. The regex code that I posted is that list that I compiled. By counting the number of false positive words I know that this filing is irrelevant and I will not have read it later for processing.

I have changed the code a bit and used File::Map to speed it up but I am not sure if I am doing it right. Also, someone asked if the regex worked. Yes, regex works but it is slow and I am trying to make it faster.

map_file my($data), $filing; $fcount=()=$data=~m/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\ +s+negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abo +ut\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|g +uidance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xig;

I am also attaching some sample text

http://sec.gov/Archives/edgar/data/1011737/0001193125-06-122041.txt

http://sec.gov/Archives/edgar/data/1012270/0001104659-07-059430.txt

http://sec.gov/Archives/edgar/data/1016281/0001104659-03-016871.txt

http://sec.gov/Archives/edgar/data/1166036/0001104659-09-021080.txt

http://sec.gov/Archives/edgar/data/1019361/0001019361-04-000007.txt

http://sec.gov/Archives/edgar/data/1013934/0000950136-04-003588.txt

Thank you all again for everything!


Comment on Re: Help with speeding up regex
Download Code
Re^2: Help with speeding up regex
by BrowserUk (Pope) on Aug 13, 2012 at 00:08 UTC
    1. If this is to eliminate false positives, why is it necessary to count all the negative hits?

      Doesn't the presence of just one false hit exclude a document?

      If so, the simplest optimisation might be remove the /g;

    2. Presumably, this is just one example of a generic problem?

      Otherwise, if you'd just left the regex running from the point where you posted your question, until you posted your follow-up, you would have processed a little under 1 million documents of the size of those you've linked.

    3. If it is a generic problem of optimising complex regexes, then you'll need a programmable solution.

      Whilst it may be possible to hand-optimise the supplied regex to cut runtime, you'd then be faced with having to do it all again for the next set of false matches.

    4. Perhaps the next simplest solution would be to multi-task your processing of the documents.

      Spread your load across the 4 cores of a typical current machine and you can cut your processing time to a 1/4.

      Purchase a $100 of Amazon's EC2 time and cut your processing time to 1/100th or less.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Dear Browser, Thank you for your kind reply. Can you tell me how I can do multi-task processing? that would be very very helpful! Thank you again!

        At its simplest, split your list of document filenames or urls into (say) 4 files, then (concurrently) run 4 copies of your program supplying a different filename to each.

        Beyond that, we'd need to see the structure of your current program before we could advise on ways of multi-tasking it.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://986968]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2014-07-25 12:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (171 votes), past polls