RFC: A new module to help avoid running multiple instances of the same script (via cron, for example)

Sometimes you have a script that needs to be run at regular intervals (weekly, daily, hourly, etc.) On unix, this is usually accomplished using cron. However, if a cron job takes a sufficiently long time to run, the next instance may begin running before the current instance completes. Depending on what the script does, this may create problems such as: consuming computer resources (cpu, memory, disk, bandwidth), corrupting database tables, or damaging files and directory structures.

In my case, I have a script that parses log files from several different locations (some at remote sites), and aggregates the results into a single database. There is a web app that allows users to view this data, issue queries, generate reports, etc. The process of parsing all of the log files may take several minutes, so doing this "live" every time someone accesses the web app is not feasible. Instead, I run my "database update" once per hour, using cron, and let the web app pull from the database. The disadvantage is that "new" data may not show up for up to 60 minutes. Originally, this was not a problem. But as more people have begun using the web app, there have been requests to see more "dynamic updates". The easy fix was to simply run the cron job every 10 minutes. The update usually takes no more than 5 minutes, so this works. But occasionally a big update may take 15 minutes or more. Trust me when I tell you that running an update, while another update is in progress, causes Very Bad Things To Happen ™

My initial fix to avoid this was to create a "lock file" as the script began running, then delete it when it finished. I checked for the existence of the lock file before creating it, and if it was there already I could assume that another job was running, and simply exit.

This seemed to work, until we had an unexpected power outage (in the middle of an update, of course). Since the script failed to exit normally, it didn't delete the lock file. Several days later, users began to complain about missing data. When I investigated the cause, I discovered that the old lock file was causing all the updates to abort.

At that point, I decided to actually store the job's process id in the lock file. Then, if a job sees a lock file, it can read the old process id, and check to see if there's actually a running process that has that id, and if so, exit - otherwise, assume it's the result of an interrupted job and continue on, cleaning up as necessary.

And then I had another problem arise, completely unrelated, except that it also needed a script to be run via cron, and bad things would happen if multiple instances ran at the same time. So I started thinking about how I could abstract this "only run one instance via cron" functionality out into a module, so that I could easily add it to any script with a single line (something like, 'use Cron::AvoidMultipleRuns;', for example).

I have not been able to find anything that does this on CPAN, nor using google- but I may be overlooking it. If anyone knows of anything like this, please let me know.

Otherwise, I'm posting my solution here. If you have any comments or suggestions for ways to improve it, please let me know.

Here's an example script, pretend that this is run via cron:

#!/usr/bin/perl
use strict;

# make it impossible to run this script more than once at a time
use Cron::AvoidMultipleRuns;

print "Running... pid = $$\n";

# pretend this script actually does something here...
sleep 20;
[download]

Here is the module abstracting out the "run only once" behavior:

package Cron::AvoidMultipleRuns;
use strict;

my $cleanup;

INIT {
  $cleanup = 0;
  (my $pid_file = "$0.pid") =~ s/\.pl//;
  if ( -e $pid_file ) {
    open my $fh, '<', $pid_file or 
      die "Can't open $pid_file for reading: $!\n";
    my $line = <$fh>;
    my (undef, $pid) = split(/\s+/, $line);
    if ($pid) {
      print "Found a pid file: pid = $pid.\n";
      my $status = kill(0, $pid);
      if ($status) {
        print "old job is still running.\n";
        exit;
      } else {
        print "The old job is no longer running.\n";
      }
    }
  }

  open my $fh, '>', $pid_file or 
    die "Can't open $pid_file for writing: $!\n";
  flock ($fh, 2) or 
    die "can't obtain exclusive lock on file $pid_file: $!\n";
  print $fh "pid: ", $$, "\n";
  close $fh;
  $cleanup = 1;
}

END {
  (my $pid_file = "$0.pid") =~ s/\.pl//;
  if ( -e $pid_file && $cleanup ) {
    unlink $pid_file or 
      die "can't unlink $pid_file : $! \n";
  }
}

1;
[download]

Here's how it works: if the script is called 'demo.pl', then when it runs, a file called 'demo.pid' is created containing a line like this: "pid: 12345" where '12345' is the process id number. When the job completes, 'demo.pid' is deleted. If you try running demo.pl again, in another terminal, while the first job is still running, it should see the pid file and exit immediately. If you create a bogus pid file, then try running demo.pl, it should see that no process with that id is actually running, and go ahead and create a new pid file with the current, valid, process id. I originally used BEGIN and END, but I didn't like that 'perl -c demo.pl' actually caused a pid file to be created/deleted. So I changed BEGIN to INIT (to avoid creation), then introduced the $cleanup variable to avoid the attempted deletion. It works, but seems a little kludgy. My other concern is whether or not this module would have some kind of unwanted interaction with other modules that also use the INIT and END blocks.

Back to Meditations