http://www.perlmonks.org?node_id=466491

Marcello has asked for the wisdom of the Perl Monks concerning the following question:

A large Perl application currently queries a certain MySQL table for records. As soon as records are found, they are processed by the application and DELETE'ed from the database. So, basically the table acts as a queue.

This process needs to be as fast as possible. Therefore, the idea is to use multiple Perl processes (2 or more) to query this database and process the records.

The requirements are as follows:
  • Process the records as fast as possible
  • Avoid different processes handling the same record

    Sometimes a process cannot handle records when a connection to a remote server is lost (which could take hours to resolve). At this point, the other processes should handle its records or otherwise the records would stay in the queue for too long.

    I am not sure what kind of logic to implement to meet the requirements above.

    For 2 processes, one could handle the records with odd Id's and the other one with even Id's, but then still the second requirement is not met when one process cannot handle the records.

    If processes SELECT the same records, there is a significant overhead because 50% or more of the records are already handled by other processes and therefore should not be handled.

    What is the best approach I can use to meet these requirements?

    TIA!
  • Replies are listed 'Best First'.
    Re: Database queue logic
    by ruoso (Curate) on Jun 14, 2005 at 12:20 UTC

      I've dealt with a similar problem once...

      The key to the solution is that the $sth->rows method will return the number of rows affected by the last query-like statement. SELECT *and* UPDATE are query-like statements, so here it goes...

      You have to create a control field in the table with an int value and set it's default value to some non-null value.

      Then you do the following...

      -- fetch 10 to avoid querying too many times.... SELECT * FROM tablex ORDER BY dateofinsertion LIMIT 10; -- Mark the register as owned by you UPDATE tablex SET controlfield=controlfield+1 WHERE id=$id AND control +field=$controlfieldvalue;

      When you execute the second query, you test the rows.

      if ($sth->rows) { # it did the update, so the row is mine. } else { # someone else took the register, let's try another one }

      Simple, isn't it? If some process fail, next time a process get the 10 registers, it will get that again and try to process.

      daniel
    Re: Database queue logic
    by trammell (Priest) on Jun 14, 2005 at 20:20 UTC
      I'd use something like the following:
      # pseudocode while (1) { acquire_lock(); # blocks choose_record(); # choose an unflagged record flag_record(); # this one is mine release_lock(); # let others in process_record(); # do work }

      MySQL has "advisory" locking that is suitable for this (see MySQL functions GET_LOCK() and RELEASE_LOCK()).

    Re: Database queue logic
    by monarch (Priest) on Jun 14, 2005 at 12:11 UTC

      Update 2: As Foxcub correctly points out below, it is futile to protect the SELECT using a MUTEX but not wrapping the DELETE in the same call. Thus I am completely wrong, my logic is entirely flawed. Apologies.

      Update 3: Added readmore tags.

        This was an idea that crossed my mind, but it causes an overhead because processes have to wait for the MUTEX to be released and cannot process records simultaneous.

        Marcello

          Update 2: as pointed out by the reply to this post by Foxcub I was wrong to ignore the fact that the DELETE also needs to occur within the atomic action that contains the SELECT. Apologies.

          Update 3: added readmore tags.

    Re: Database queue logic
    by Solo (Deacon) on Jun 14, 2005 at 17:33 UTC
      I would have a single process that watches and updates the table which farms records out to worker processes--pre-forking if performance dictates.

      How fast do you want to check for new records? Rather than repeatedly querying the table, you might consider an insert trigger (not sure if that could work) or tailing a log file instead.

      --Solo

      --
      You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
    Re: Database queue logic
    by astroboy (Chaplain) on Jun 14, 2005 at 19:56 UTC

      I'd have a single master process monitoring the queue in a tight loop. As soon as there is a new record to deal with, the master process would spawn a worker to perform the desired tasks. Spawned processes can only deal with the record that they have been allocated. If a spawned process can not handle the task, it alerts the master process, who will reallocate the task as required

      Of course, a lot of this comes down to how much resource you've got. You don't want to spawn more processes than the server can handle. You may wish to precreate a finite pool of workers if resources are a problem

    Re: Database queue logic
    by Ben Win Lue (Friar) on Jun 14, 2005 at 12:12 UTC
      I would add a timestamp column to my process list.
      Any process would:
      • select all processes that have a null timestamp or a timestamp that's older than 10* Minutes
      • update all selected processes with the timestamp of now.
      • process all selected processes
      • delete processed processes

      *the 10 Minutes are subject to fiddling
        You're still racing between the select and the update, although something like this might work:

        • Update records where 'owner' column is null, with *my* value and timestamp now().
        • Select records with my owner value *or* where timestamp is older than threshold. This catches the case where some records have gone stale because a process died or what-have-you.
        • Process & delete selected records.
        This way you're letting the database server handle updating the correct records. Depending on how speedily new records are added to the queue, it seems likely that processes will be working with records more than waiting for selects. Of course this approach doesn't help with splitting the records evenly between processes, but neither will a mutex.
          Good solution,

          I will try this approach shortly.

          Thanks all!
    Re: Database queue logic
    by thcsoft (Monk) on Jun 14, 2005 at 13:30 UTC
      anyway, you will have to perform some kind of lock on the data. a mutex seems to be the correct approach to me - in a not-threaded environment.

      @Marcello
      what about threading? is it a must for your concept to run in different processes? if, once threading, you take one shared scalar for each of the tables your threads will be acting upon, your problem will possibly no longer be one.

      language is a virus from outer space.