comment on

I think kcott has a fine post on how to use alarm(). However, it would seem to me that a better solution would be to figure out why this thing is so darn slow and fix that so that you get a result all of the time without having to "give up".

I looked at the first couple of regexes (see below). When you are using the /x modifier, you can space this out on multiple lines and this can improve the readability a lot. You can also add comments to the lines, but there are some limitations about what can go in the #comment (see perlre doc) for more details and you cannot put a space inside of a 2 char token like the ?: in (?: ..the non-capture..), but this #comment stuff can be useful.

I see some strange things (there appear to be terms that have no purpose). Also $data (maybe a 10MB) is slurped into memory as a single variable and many regex'es are applied serially to this humongous thing. Parsing, re-parsing, re-parsing and re-parsing something big is often not a good idea performance wise.

Often, parsing something very large is best done line by line and ONLY once. Read a line, deal with it, throw it away because we are done....

I suspect that if you shared some more details about the file format and why one of these things is 10MB?, far more efficient algorithms could be devised. Your regex'es appear to do very similar things. A single pass that figures everything out on "one go" would be faster. Could even be that algorithms that just stop reading the file, once we've got what we need are appropriate?

While I was playing with this, I spaced your regex'es out (that is what the /x allows). Also show how to use the Regex::Explain function - which is sometimes useful.

So some alternate way to space out the regex'es to increase readability are shown below.

I do suspect that the "real solution" is to make this so fast that there is never any need for a 2 minute timeout! But there are some things about your application that I and others just don't understand. It would be most helpful if you could clarify further!

#!/usr/bin/perl -w
use strict;

use YAPE::Regex::Explain;

#prototype from the docs...
#my $exp = YAPE::Regex::Explain->new($REx)->explain;

my $REx1 = 
    qr{m/(?:Item|ITEM)[Ss]?\s?
    (?:\.|\-|:|\-\-|\,)?\s?
    (?:1|I)\s?
    (?:\.|\-|\:|\-\-|\,)?\s?
    (?:Description|DESCRIPTION)?\s?
    (?:[Oo][Ff])?\s?(?:[Tt][Hh][Ee])?\s?
    (?:Busine\s?ss|BUSINE\s?SS|Company|COMPANY)\s?
    (?:\.|\-|:|\-\-|\,|\()?
    (.*?)\s?
    (?:Item|ITEM)[Ss]?\s?
    (?:\.|\:|\-\-|\-|\,)?\s?
    (?:I|1A|1B|2)\s?
    (?:\.|\:|\-\-|\-|\,)?/x}; #this term apparently 
     # not needed not captured and it is optional
                                                        
my $Rex2 = 
    qr{m/(?:Business\s?Development|BUSINESS\s?DEVELOPMENT)\s?
    (.*?)\s?
    (?:Item|ITEM)[Ss]?\s?
    (?:\.|\-|:|\-\-|\,)?\s?
    (?:I|1A|1B|2)\s?
    (?:\.|\:|\-\-|\-|\,)?/x};
                
my $Rex3 = 
    qr{m/(?:PART|Part)\s?
    (?:\.|\-|\:|\-\-|\,)?\s?
    (?:I|1)\s?
    (?:\.|\-|:|\-\-|\,)?\s?
    (?:BUSINESS|Business|GENERAL|general)
    (.*?)\s?
    (?:Item|ITEM)[Ss]?\s?
    (?:I|1A|1B|2|3)\s?
    (?:\:|\-|\,|\-\-|\.|\,)?/x};                

print YAPE::Regex::Explain->new($REx1)->explain;
[download]

Update: here is an example of an inefficiency:

if($data!~m/table\s?of\s?contents?|\sindex\spart\s(?:1|I)/i){
[download]

its gonna search the whole 10MB to figure out that this match does not exist. I suspect that there is a far faster way to do this job? Maybe that's not possible, but I doubt that. I think you should be asking the Monks how to make your algorithm run so darn fast that this 2 minute time out is irrelevant. I would not be surprised if the total time to get all results is 5-10x faster but without knowing more I certainly can't guarantee that but if I was in Vegas, I would put some money down on that proposition. But you have to explain more - not enough information is known.

In reply to Re: Help with timeout by Marshall
in thread Help with timeout by eversuhoshin

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks