Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Looking for a resource management / job queue module

by elTriberium (Friar)
on Jul 20, 2012 at 20:55 UTC ( #982905=perlquestion: print w/ replies, xml ) Need Help??
elTriberium has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

This question is not completely Perl-focused as I would be fine with a non-Perl solution, too. However, since our whole setup is currently based on Perl a Perl-based solution would be ideal.

We currently have an environment where we run automated test cases on a combination of Linux servers (S), Linux clients (L) and Windows clients (W). Each individual test might require 0, 1 or more of S, L and W. We have to guarantee that no other tests will run on the same systems at the same time

Currently we have pre-allocated testbeds for several test runs (nightly regression tests and such), but that leaves these systems idle for a long time (a nightly run typically takes ~12 hours, so the systems are idle for the other 12 hours every day). Also, it's making it hard for users to just quickly run some tests as they need a dedicated testbed.

So what we're looking for is a system that can manage these (S, L, W) resources and reserve them for the individual test runs. It should also support a job queue so that a job will be queued when the required resources are currently not available

Once the resources are available it should "reserve" them (or just mark them as reserved in a database) and provide them to our in-house Perl-based tool that launches the tests on them (our tool connects via SSH and Telnet, so that part doesn't need to be included). Once our tool finishes the system I'm looking for should mark the resources as "unreserved" and put them back into the resource pool.

We already looked at many existing solutions, but most don't look completely sufficient:

  • TheSchwartz CPAN module: Doesn't seem to provide a resource reservation system
  • POE::Component::JobQueue: Looks like it only supports individual workers (clients), but not a combination of (S, L, W) as mentioned above
  • Condor and similar Grid management tools: Seem like overkill and also they expect to run the actual jobs on the individual nodes, which is not our use case

My question is if something like this is already available? Is anybody doing something similar? I don't expect it to be such an uncommon environment where we have multiple systems and need to reserve them to run tests on them. If there are multiple modules that I can combine (e.g. TheSchwartz and some resource manager) then that would also be fine. I'd appreciate any help!

Comment on Looking for a resource management / job queue module
Re: Looking for a resource management / job queue module
by Marshall (Prior) on Jul 21, 2012 at 15:33 UTC
    This could get to be pretty complicated if you require a fancy scheduling algorithm - but it could be that something fairly simple would work well enough to get started.

    It sounds like you already have the concept of a central "cop" program that starts these various tests and you need a resource manager for it to keep track of the resources.

    You could use a DB to track resources, but one common way is to create a series of zero or one byte files, each file representing one of the resources. The resouce is in use if the "cop" program can acquire an exclusive lock (write lock) on the file. Release the lock when the test is over. If the "cop" program dies, all the locks are released (a file lock is a memory resident structure - not something on the disk). This way you don't have to clean up a DB on a restart.

    Your Perl program keeps a table of who is using what. The hard bit is say test1 uses a couple of resources, test2 needs them all, test3 uses couple of the resources (although different ones than test1). If you want test1 and test3 to run in parallel, and then run test2, that requires "more smarts" than just running down the queue sequentially and waiting until resourses are available for the next test. If the queue order was different (test1 test3 test2) then a simple algorithm would run 1,2 together and then run 3 once both 1 and 2 had finished. How "smart" the scheduler needs to be depends upon the job mix and other factors (like how important maximal efficiency is and how long these various tests run). Maybe some of the tests that only need a couple of resources run a long time and the one that needs them all is fast - I don't know.

    Sorry if this wasn't much help, but maybe you will get some ideas. You could "roll your own" simple manager and just see how well (or not well) it works out in practice. The job queue could just be a "drop directory" with files that describe the jobs. Try FIFO first and see how it works out. Increase complexity as needed.

    Sorry that I am not aware of a CPAN module that would do this all - but that doesn't mean that such a thing doesn't exist! Maybe there is some way so that your simple resource control's simple "enough resources now, y/n?" can be combined with an existing module. I presume that would have the effect of running jobs that require fewer resources at a "higher priority" than ones that require more? Any way I recommend starting simple and measuring how well it works.

    "reserving" some of the resources in advance without being able to acquire all the resources at the same time can lead to "deadlocks". Sorry if I wasn't more help. The general problem for maximal efficient use of resources is difficult (at least for me). But I am hoping that something simple will "move the ball forward" and perhaps even allow developer's to inject other tests into the nightly run's mix of regression tests (software folks are known "night owls").

      Thanks, this was helpful. I'm thinking about writing this myself, but there are a lot of corner cases to take care of (what if a resource goes down / is reserved by someone else? What if a job never finishes? What if I need to scale this up and support multiple "job submit nodes"?) That's why I was hoping for an existing solution.

      There are a lot of Grid schedulers (Condor, Sun Grid Engine forks, Torque, etc.), but the problem I see with most of them is that they operate under the assumption that they control the actual jobs and start / stop the individual processes. That's not the case in our environment where we already have the "control job" (basically a customized version of the TAP::Parser module).

      It sounds like the perfect use-case for Tapper.

      There we have a scheduler that maintains HOSTS and QUEUES. Queues usually mean a test use-case (like "linux-stable", "linux-rc", etc.). You put test requests into a queue inclusive some "requested host features" spec, let the scheduler decide which queue next to choose per bandwidths and available hosts. Test requests can "re-queue itself" to create a continuous rotation of the use-cases.

      Setting up Tapper with all features (as used in the OSRC where we set up machines from scratch to with other distributions and Xen/KVM setups) can be a bit tricky but you seem to be ok with using ssh.

      See http://renormalist.net/misc/ for public material about it.

      Tell me if you already found another solution. Else I could help you set up a Tapper instance step by step.

Re: Looking for a resource management / job queue module
by Anonymous Monk on Jul 21, 2012 at 16:55 UTC
    This is a finite-domain problem. Look at Gnu Prolog...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://982905]
Approved by davies
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-09-17 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (100 votes), past polls