Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Searching large files before browser timeout

by aijin (Monk)
on Jun 13, 2001 at 00:03 UTC ( #87921=perlquestion: print w/ replies, xml ) Need Help??
aijin has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on a cgi tool that takes a file and searches for a particular pair of strings and then optionally sorts the results, eliminating any duplicates. I thought I had everything working fine until someone tried to run an 80MB file through it, and his browser timed out before the searching was done.

I'm looking for suggestions on how best to approach this problem. It seems that this fellow isn't an isolated case and more people are going to need this tool to search files of this size, or even larger.

I know I could just read x lines at a time and display the results for just that section, but this presents a problem if the user requests that the results be sorted and that all duplicates are removed. The tool currently provides a count of the matches as well, something I could not easily provide with this method.

Any suggestions?

Comment on Searching large files before browser timeout
Re: Searching large files before browser timeout
by WrongWay (Pilgrim) on Jun 13, 2001 at 00:21 UTC
    Without seeing any code I would say you have 2 options.

    1. Build a queing system, where all the work is done by a perl cronjob, and the user can keep refreshing a que status page till his/her job is completed.

    2. Pre-split/sort your file(s). This should allow a quicker way to search. 80mb is pretty hefty, Maybe 40 2mb files would be better.



    Just my $.02 worth.
    WrongWay
Re: Searching large files before browser timeout
by LD2 (Curate) on Jun 13, 2001 at 00:34 UTC
    The most common advice here at the Monastery is to check Super Search first, before posting a question. Here is a node that may help you...Browser Timeout
Re: Searching large files before browser timeout
by shotgunefx (Parson) on Jun 13, 2001 at 00:43 UTC
    I had a similar problem and there a couple ways of approaching it. I opted for turning off buffering and printing a "Pleast wait" message and emitting a "." or similar every thousand or so records so they wouldn't hit refresh and spawn another copy of the process.
    Then when it was finished I displayed the results.

    You could also fork the actual search off as another process and display a "searching.." page with a meta-refresh or server-push, this way you could handle problems with impatient people spawning lots of processes.

    You might also be interested in this related node.

    -Lee

    "To be civilized is to deny one's nature."
      Very Simple!, Very Effective!
      And my choice in solving this problem fo my own pages.
      BatGnat

      BALLOT: A multiple choice exam, in which all of the answers above are incorrect!
Re: Searching large files before browser timeout
by tachyon (Chancellor) on Jun 13, 2001 at 00:54 UTC

    One way to avoid the timeout would be to immediately send a results page to the browser. We cheat a bit. First generate a unique temporary page temp12345.htm where we will write our results when they become available. Next send a 307 Temporary Redirect header back to the browser that points to this temp12345.htm page. This temp page will then appear in the users browser window. Some text like:

    We are processing your job, please click refresh
    now and again to see it your job is complete!
    

    will inform the user of what is happening.

    You are then free to process the job. All you then need to do is write the result to your temp12345.html page and then when the user next presses refresh - voila the result. No timeouts. Until the results are written the user just gets the same please wait page with each refresh.

    For elegance you could add a meta refresh to refresh the page every 10 seconds or whatever, then the user does not even need to worry about the refresh. When you write the result you dump the auto refresh so the final page does not keep reloading.

    merlyn explains the whole thing in detail (and with code) here

    Hope this helps

    tachyon

      You can specify a reload in the header ("client pull") rather than needing any client-side script.

        Agree. Getting tired. And Sloppy. Corrected. Thanks. tachyon

        sleep(28800);
        From experience, I would be wary of relying on browser refresh (whether manual or not). Web page caching occurs in so many hops between host and client, and varies in behaviour between - not only - browser type, but also browser version.

        Expect some visitors to experience unwanted behaviour from a setup like this. Unfortunately I know of no way to avoid it, other than not to rely on browser refresh.

        --
        Graq

Re: Searching large files before browser timeout
by John M. Dlugosz (Monsignor) on Jun 13, 2001 at 00:55 UTC
    Use "server push" technology to send a status report before the final answer. The very existance of that feature should stop the browser from timing out.
      I don't believe that "server push" is very portable. Pretty sure that IE doesn't support it anyway.

      $code or die
      $ perldoc perldoc
Re: Searching large files before browser timeout
by Davious (Sexton) on Jun 13, 2001 at 09:23 UTC
    We have an application where we needed to parse 100+ meg log files in real time for various strings and I had the same problem. I found that I was able to get a significant speed boost by offloading the string matching aspect to unix grep and piping the output into perl. It cut the time down from several minutes to under 30 seconds.
    $cmd = qq|grep '$string' access.log|; open(LOG,"$cmd|"); while (<LOG>) { # etc..
      Having UNIX grep rather than Perl grep do the searching would usually slow your program down. Perl grep is usually faster than UNIX's grep but slower than UNIX's egrep. Of course, YYMV.
        Hmm, well in my case it was blindingly faster. Keep in mind I wasn't searching for anything more complicated than a fixed string (ie: '127.0.0.1') not a regexp or anything of that nature.
      This works great, thank you! I just benchmarked searching through a smallish file, using Perl pattern matching and grep.

      Benchmark: timing 10000000 iterations of Grep, Perl...
      Grep: 19 wallclock secs (16.38 usr + 0.03 sys = 16.41 CPU)
      Perl: 101 wallclock secs (80.91 usr + 8.11 sys = 89.02 CPU)

      What a difference!

Re: Searching large files before browser timeout
by mattr (Curate) on Jun 13, 2001 at 11:19 UTC
    One industrial-strength way is to fork off a child which does the processing while the parent keeps the browser from timing out by printing spaces, periods, or intermittent status messages. You need to have the parent set output autoflush on ($|=1). You also could do it without a child but use alarms, as mentioned in the timeout discussion mentioned above.

    If you can get the child to send messages to the parent thread during processing, those intermittent status messages could be more interesting. I am thinking of doing this for a similar problem we talked about recently at the monastery.. some message passing pipe or possibly IPC may be useful for this.

    One idea, if you are going to have a ton of files to be processed maybe you want to have one server which just searches all these files, doing the optimization, scheduling, and sorting you need done, and have cgi processes talk to the server. That way you might be able to allot more cpu to the processing daemon. But you might get similar timeout issues.

Re: Searching large files before browser timeout
by Anonymous Monk on Jun 14, 2001 at 07:53 UTC
    stat the file
    if it is "too big" fork off a background process that will e-mail the results

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://87921]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2014-11-24 02:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (135 votes), past polls