Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I think kcott has a fine post on how to use alarm(). However, it would seem to me that a better solution would be to figure out why this thing is so darn slow and fix that so that you get a result all of the time without having to "give up".

I looked at the first couple of regexes (see below). When you are using the /x modifier, you can space this out on multiple lines and this can improve the readability a lot. You can also add comments to the lines, but there are some limitations about what can go in the #comment (see perlre doc) for more details and you cannot put a space inside of a 2 char token like the ?: in (?: ..the non-capture..), but this #comment stuff can be useful.

I see some strange things (there appear to be terms that have no purpose). Also $data (maybe a 10MB) is slurped into memory as a single variable and many regex'es are applied serially to this humongous thing. Parsing, re-parsing, re-parsing and re-parsing something big is often not a good idea performance wise.

Often, parsing something very large is best done line by line and ONLY once. Read a line, deal with it, throw it away because we are done....

I suspect that if you shared some more details about the file format and why one of these things is 10MB?, far more efficient algorithms could be devised. Your regex'es appear to do very similar things. A single pass that figures everything out on "one go" would be faster. Could even be that algorithms that just stop reading the file, once we've got what we need are appropriate?

While I was playing with this, I spaced your regex'es out (that is what the /x allows). Also show how to use the Regex::Explain function - which is sometimes useful.

So some alternate way to space out the regex'es to increase readability are shown below.

I do suspect that the "real solution" is to make this so fast that there is never any need for a 2 minute timeout! But there are some things about your application that I and others just don't understand. It would be most helpful if you could clarify further!

#!/usr/bin/perl -w use strict; use YAPE::Regex::Explain; #prototype from the docs... #my $exp = YAPE::Regex::Explain->new($REx)->explain; my $REx1 = qr{m/(?:Item|ITEM)[Ss]?\s? (?:\.|\-|:|\-\-|\,)?\s? (?:1|I)\s? (?:\.|\-|\:|\-\-|\,)?\s? (?:Description|DESCRIPTION)?\s? (?:[Oo][Ff])?\s?(?:[Tt][Hh][Ee])?\s? (?:Busine\s?ss|BUSINE\s?SS|Company|COMPANY)\s? (?:\.|\-|:|\-\-|\,|\()? (.*?)\s? (?:Item|ITEM)[Ss]?\s? (?:\.|\:|\-\-|\-|\,)?\s? (?:I|1A|1B|2)\s? (?:\.|\:|\-\-|\-|\,)?/x}; #this term apparently # not needed not captured and it is optional my $Rex2 = qr{m/(?:Business\s?Development|BUSINESS\s?DEVELOPMENT)\s? (.*?)\s? (?:Item|ITEM)[Ss]?\s? (?:\.|\-|:|\-\-|\,)?\s? (?:I|1A|1B|2)\s? (?:\.|\:|\-\-|\-|\,)?/x}; my $Rex3 = qr{m/(?:PART|Part)\s? (?:\.|\-|\:|\-\-|\,)?\s? (?:I|1)\s? (?:\.|\-|:|\-\-|\,)?\s? (?:BUSINESS|Business|GENERAL|general) (.*?)\s? (?:Item|ITEM)[Ss]?\s? (?:I|1A|1B|2|3)\s? (?:\:|\-|\,|\-\-|\.|\,)?/x}; print YAPE::Regex::Explain->new($REx1)->explain;
Update: here is an example of an inefficiency:
if($data!~m/table\s?of\s?contents?|\sindex\spart\s(?:1|I)/i){
its gonna search the whole 10MB to figure out that this match does not exist. I suspect that there is a far faster way to do this job? Maybe that's not possible, but I doubt that. I think you should be asking the Monks how to make your algorithm run so darn fast that this 2 minute time out is irrelevant. I would not be surprised if the total time to get all results is 5-10x faster but without knowing more I certainly can't guarantee that but if I was in Vegas, I would put some money down on that proposition. But you have to explain more - not enough information is known.

In reply to Re: Help with timeout by Marshall
in thread Help with timeout by eversuhoshin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (10)
    As of 2014-07-30 05:31 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (229 votes), past polls