Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Help with timeout

by eversuhoshin (Sexton)
on Sep 24, 2012 at 19:23 UTC ( #995431=perlquestion: print w/ replies, xml ) Need Help??
eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Dear monk, I need help with timing out. I have a script that extracts excerpts from financial statements. The basic format is the following.

foreach (@files){ my $excerpt; my $match; my $start1 = time; #measure time for each file my $duration1; my $data = slurp($_); if($data!~m/table\s?of\s?contents?|\sindex\spart\s(?:1|I)/i){ if($data=~m/(?:Item|ITEM)[Ss]?\s?(?:\.|\-|:|\-\-|\,)?\s?(?:1|I +)\s?(?:\.|\-|\:|\-\-|\,)?\s?(?:Description|DESCRIPTION)?\s?(?:[Oo][Ff +])?\s?(?:[Tt][Hh][Ee])?\s?(?:Busine\s?ss|BUSINE\s?SS|Company|COMPANY) +\s?(?:\.|\-|:|\-\-|\,|\()? (.*?)\s?(?:Item|ITEM)[Ss]?\s?(?:\.|\:|\-\-|\-|\,)? +\s?(?:I|1A|1B|2)\s?(?:\.|\:|\-\-|\-|\,)?/x){ $excerpt=$1; $match=3; goto record; } if($data=~m/(?:Business\s?Development|BUSINESS\s?DEVELOPMENT)\ +s? (.*?)\s?(?:Item|ITEM)[Ss]?\s?(?:\.|\-|:|\-\-|\,)?\ +s?(?:I|1A|1B|2)\s?(?:\.|\:|\-\-|\-|\,)?/x){ $excerpt=$1; $match=4; goto record; } if($data=~m/(?:PART|Part)\s?(?:\.|\-|\:|\-\-|\,)?\s?(?:I|1)\s? +(?:\.|\-|:|\-\-|\,)?\s?(?:BUSINESS|Business|GENERAL|general) (.*?)\s?(?:Item|ITEM)[Ss]?\s?(?:I|1A|1B|2|3)\s?(?: +\:|\-|\,|\-\-|\.|\,)?/x){ $excerpt=$1; $match=5; goto record; } if($data=~m/(?:Item|ITEM)?[Ss]?\s?1\s?(?:\.|\-|\:|\-\-|\,)?(?: +\.|\-|\:|\-\-|\,)?\s?(?:1A)?\s?(?:\.|\-|\:|\-\-|\,)?\s?(?:AND|[Aa]nd| +\&)\s?2\s?(?:\.|\:|\-|\-\-|\,)?\s?(?:\.|\-|\:|\-\-|\,)?\s? (?:[Bb]usiness\s?\,\s?[Rr]isks?\s?[Ff]actors\s?(?:[Aa] +nd|AND|\&)\s?[Pp]roperties|BUSINESS\s?\,\s?RISK\s?FACTORS?\s?AND\s?PR +OPERTIES) (.*?)\s?(?:Item|ITEM)\s?[Ss]?\s?(?:\:|\-|\,|\-\-|\.)?\ +s?(?:1A|1B|2|I|3)\s?(?:\.|\:|\-|\-\-|\,)?/x){ $excerpt=$1; $match=7; goto record; } if($data=~m/(?:Item|ITEM)[Ss]?\s?(?:\.|\-|:|\-\-|\,)?\s?(?:1|I +)\s?(?:\.|\-|:|\-\-|,)?\s?(?:BUSINESS|[Bb]usiness)\s?(?:\.|\-|\:|\,)? (.*?)\s?(?:Item|ITEM)\s?(?:\.|\:|\-|\-\-|\,)?\s?(? +:I|1A|1B|2)\s?(?:\.|\:|\-|\-\-|\,)?/x){ $match=8; $excerpt=$1; goto record; } } record: if(defined($excerpt)){ $excerpt=~s/\s{2,}|\.\s|\"|\(|\)|\,|\'|\r|\n/ /g; $excerpt=~s/^\s+|\s+$//; #trim $duration1 = ceil((time - $start1)/60); #measure exec +ution time for each file until printing print "$match \n $duration1 \n $excerpt \n"; } }

a lot of times the code gets stuck because it takes time to extract the excerpt as some files are over 10MB. I want to use a timeout function to move to the next element if it takes more than two minutes. I have looked up Alarm but I don't know how to incorporate it into my code. It would be great if you could help me use a timeout in my function in case it gets stuck.

Comment on Help with timeout
Download Code
Re: Help with timeout
by Anonymous Monk on Sep 25, 2012 at 00:36 UTC
    What does the file format look like?

      the end result is just a text file with the file name, excerpt. Thank you for asking :) Cheers, Pureum

        But what is input file format? 10MB file is nothing for Perl, so if you could share some sample file (one line is enough) we could look at regexes, I guess those can be optimized in some way.
Re: Help with timeout
by kcott (Abbot) on Sep 25, 2012 at 04:54 UTC

    G'day eversuhoshin,

    "I have looked up Alarm but I don't know how to incorporate it into my code."

    Using the same code structure given in the alarm example, here's how you might do this:

    #!/usr/bin/env perl use strict; use warnings; my @iterations = map { 10 ** $_ } 0 .. 10; my $timeout = 2; for my $iterations_this_loop (@iterations) { eval { local $SIG{ALRM} = sub { die "TIMEOUT: $iterations_this_loop\n +" }; alarm $timeout; for (0 .. $iterations_this_loop) { # Processing here } alarm 0; }; if ($@) { die $@ unless $@ =~ /^TIMEOUT: \d+/; # propagate unexpected +errors print $@; next; } else { print "ENOUGH TIME: $iterations_this_loop\n"; } }

    Output:

    $ pm_long_for_alarm.pl ENOUGH TIME: 1 ENOUGH TIME: 10 ENOUGH TIME: 100 ENOUGH TIME: 1000 ENOUGH TIME: 10000 ENOUGH TIME: 100000 ENOUGH TIME: 1000000 ENOUGH TIME: 10000000 TIMEOUT: 100000000 TIMEOUT: 1000000000 TIMEOUT: 10000000000

    for my $iterations_this_loop (@iterations) { represents the number of records in each file (that will be foreach (@files){ in your code).

    for (0 .. $iterations_this_loop) { represents processing that number of records (that will be the processing you are doing within each iteration of your loop).

    Note: the argument to alarm is in seconds - for 2 minutes you'll need my $timeout = 120;.

    -- Ken

Re: Help with timeout
by Marshall (Prior) on Sep 25, 2012 at 19:37 UTC
    I think kcott has a fine post on how to use alarm(). However, it would seem to me that a better solution would be to figure out why this thing is so darn slow and fix that so that you get a result all of the time without having to "give up".

    I looked at the first couple of regexes (see below). When you are using the /x modifier, you can space this out on multiple lines and this can improve the readability a lot. You can also add comments to the lines, but there are some limitations about what can go in the #comment (see perlre doc) for more details and you cannot put a space inside of a 2 char token like the ?: in (?: ..the non-capture..), but this #comment stuff can be useful.

    I see some strange things (there appear to be terms that have no purpose). Also $data (maybe a 10MB) is slurped into memory as a single variable and many regex'es are applied serially to this humongous thing. Parsing, re-parsing, re-parsing and re-parsing something big is often not a good idea performance wise.

    Often, parsing something very large is best done line by line and ONLY once. Read a line, deal with it, throw it away because we are done....

    I suspect that if you shared some more details about the file format and why one of these things is 10MB?, far more efficient algorithms could be devised. Your regex'es appear to do very similar things. A single pass that figures everything out on "one go" would be faster. Could even be that algorithms that just stop reading the file, once we've got what we need are appropriate?

    While I was playing with this, I spaced your regex'es out (that is what the /x allows). Also show how to use the Regex::Explain function - which is sometimes useful.

    So some alternate way to space out the regex'es to increase readability are shown below.

    I do suspect that the "real solution" is to make this so fast that there is never any need for a 2 minute timeout! But there are some things about your application that I and others just don't understand. It would be most helpful if you could clarify further!

    #!/usr/bin/perl -w use strict; use YAPE::Regex::Explain; #prototype from the docs... #my $exp = YAPE::Regex::Explain->new($REx)->explain; my $REx1 = qr{m/(?:Item|ITEM)[Ss]?\s? (?:\.|\-|:|\-\-|\,)?\s? (?:1|I)\s? (?:\.|\-|\:|\-\-|\,)?\s? (?:Description|DESCRIPTION)?\s? (?:[Oo][Ff])?\s?(?:[Tt][Hh][Ee])?\s? (?:Busine\s?ss|BUSINE\s?SS|Company|COMPANY)\s? (?:\.|\-|:|\-\-|\,|\()? (.*?)\s? (?:Item|ITEM)[Ss]?\s? (?:\.|\:|\-\-|\-|\,)?\s? (?:I|1A|1B|2)\s? (?:\.|\:|\-\-|\-|\,)?/x}; #this term apparently # not needed not captured and it is optional my $Rex2 = qr{m/(?:Business\s?Development|BUSINESS\s?DEVELOPMENT)\s? (.*?)\s? (?:Item|ITEM)[Ss]?\s? (?:\.|\-|:|\-\-|\,)?\s? (?:I|1A|1B|2)\s? (?:\.|\:|\-\-|\-|\,)?/x}; my $Rex3 = qr{m/(?:PART|Part)\s? (?:\.|\-|\:|\-\-|\,)?\s? (?:I|1)\s? (?:\.|\-|:|\-\-|\,)?\s? (?:BUSINESS|Business|GENERAL|general) (.*?)\s? (?:Item|ITEM)[Ss]?\s? (?:I|1A|1B|2|3)\s? (?:\:|\-|\,|\-\-|\.|\,)?/x}; print YAPE::Regex::Explain->new($REx1)->explain;
    Update: here is an example of an inefficiency:
    if($data!~m/table\s?of\s?contents?|\sindex\spart\s(?:1|I)/i){
    its gonna search the whole 10MB to figure out that this match does not exist. I suspect that there is a far faster way to do this job? Maybe that's not possible, but I doubt that. I think you should be asking the Monks how to make your algorithm run so darn fast that this 2 minute time out is irrelevant. I would not be surprised if the total time to get all results is 5-10x faster but without knowing more I certainly can't guarantee that but if I was in Vegas, I would put some money down on that proposition. But you have to explain more - not enough information is known.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://995431]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2014-11-28 06:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (193 votes), past polls