Now lets say the 1st command finishes, and updates the database, incrementing its session counter
(This is just me thinking how I would avoid your undefined jobnode issue. If you're happy with your current solution, stick with it.:)
Your current mechanism uses a single queue for all resources; and has to constantly poll each of the pending jobs and then check the required resource to see if it is available. This puts a very expensive 'busy loop' at the heart of your scheduler.
If you have a lot of jobs queued for popular (or slow; or both) resources at the head of your queue; you are going to be constantly polling over those pending jobs and querying the DB for the resource status; in order to get to the newer pending requests for less popular/higher throughput resources. In other words, the slower, more popular resources become a bottleneck in front of all your other faster, more transient ones.
That smacks of a scaling issue being designed into the very heart of your processing.
I would tackle the queuing for resources in a quite different way:
- Jobs get queued to resource specific queues.
Implementation: A hash of queues keyed by resource name or id..
- When a job is enqueued, it is inspected and is added to the appropriate queue.
- You have another 'jobDone' queue. When a job is completed, its jobid/resourceid pair is queued to this queue.
And the heart of your scheduler is a thread reading (dequeuing, not polling), that jobDone queue.
As it pulls each jobid/resourceid off that queue, it:
- Updates the job status in %nodes.
- Updates the DB resource status (if still necessary).
- Takes the next pending job off the queue associated with the now freed resource and puts it on the work queue.
- Goes back to dequeue the next done job; blocking if there's nothing there.
My understanding was that when you enqueue something into the Thread::Queue, you get shared_clone of that object and thus cannot make direct modifications to the object. The hash is simply the means for returning information to the jobNode in the originating thread.
That is true, and I can see how this is affecting your design decisions. But not for the good.
The fact that when you send an object (node) via a queue means you get a copy means that you now require a second queue to send the (modified) node back to somewhere else so that it can read the modified copies information and update the original. This gets complicated and expensive. And is completely unnecessary!
You have your shared %nodes Every thread can access that structure directly. So don't queue nodes; queue node ids.
When the receiver dequeues, instead of getting an unshared clone of the node, it just gets an ID string, that it uses to directly access the shared %nodes, to read information and update directly.
Now you have no need for the return queue (nor anything to read it!). All your queue messages become lighter and faster; and there is one central, definitive, always up-to-date copy of the complex, structured information.
This next bit is speculative and would require careful thought and analysis; but if your architecture lends itself, you may even avoid the need to do locking on %nodes.
If you can arrange that each jobid/resourceid pair token, (a string), is (can only) be created once, then as it gets passed around, only one thread at a time can ever hold it, so there is no need to lock.
I can almost feel your frustration as you read this and are thinking: "But I'd have to re-write the whole damn thing"! If it doesn't work for you, just don't do it! You don't even have to tell me :)
But consider the ideas, because done right, you'd have a light, fast, self-synchronising, scalable process with no bottlenecks and no cpu-sapping busy-loops.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
When a job is enqueued, it is inspected and is added to the appropriate queue.
| [reply] [d/l] [select] |
Seriously awesome insights, and you've touched on one of the design decisions we made a while back.
Jobs get queued to resource specific queues.
We discussed this at great length, and decided on our current approach for one main reason (that i dont think i really mentioned before): When a job is first issued, it does not necessarily have to be set for a specific resource, but it could have a list of resource that are an adequate match. This being the case, we could still definitely make it work, but just werent entirely happy with some of the caveats. We saw two main approaches to tackling this:
- Put the job in all of the individual queues that match and find some way to prevent it from getting run twice (which seemed like it would take a fair bit of work)
- Or put it in the shortest queue. This however is making the assumption that the shortest queue would lead to faster execution, but depending on the job, this may not be the case
With those points in mind, do you still think multiple resource specific queues would be the best approach? Legitimately curious. I do however see your point about there being a bottlneck in our queue, but with the current implementation, I dont really think its quite as bad as you say.
If you have a lot of jobs queued for popular (or slow; or both) resources at the head of your queue; you are going to be constantly polling over those pending jobs and querying the DB for the resource status;
Because again we only query the database one time per pass over the array, not each individual element. After we do that, and build our hash of the available resources, wouldn't checking the jobs be relatively quick? I'm not saying this trying to cling to my current implementation, but I just dont see how querying the database once, building a hash, and checking each node in the single queue would be drastically slower than dequeueing constantly, and checking the database every time.
(This is just me thinking how I would avoid your undefined jobnode issue. If you're happy with your current solution, stick with it.:)
Yeah, i appreciate the advice, I wouldnt say im thrilled with it, but I think it would take a decent bit of time to rework this aspect, and at the moment I'm just not seeing the benefits justifying it. (Although the code would be much prettier :( )
You had me at:
So don't queue nodes; queue node ids.
Genius. Again, ashamed I did not think of this. Shouldn't be terribly difficult to implement either.
This next bit is speculative and would require careful thought and analysis; but if your architecture lends itself, you may even avoid the need to do locking on %nodes.
Definitely an intriguing idea. I'll keep this in mind, but think i'll have to get stuff working without before really investigating this aspect :P
I can almost feel your frustration as you read this and are thinking: "But I'd have to re-write the whole damn thing"! If it doesn't work for you, just don't do it! You don't even have to tell me :)
Quite the contrary, I am a fresh CS graduate and do not work with any programming (let alone perl) experts. I'm happy for the suggestions and advice. Only way to learn.
| [reply] |
With those points in mind, do you still think multiple resource specific queues would be the best approach?
Yes. I'd tackle the 'which resource queue for non-specific jobs' using your option 1. Queue it (the jobid) to all applicable. And to deal with the run twice problem I have a status field in the object that the first resource worker that dqs it, changes from 'pending' to 'running'. Any subsequent worker dqing it sees the running and simply discards it and grabs the next one.
Multiple tokens means locking is required; and the workers need to lock the status before reading it to prevent two resources seeing pending simultaneously.
But if the node hashes (within %nodes) are also shared, then you only need lock the specified node, not the entire %nodes, so the lock time will be brief and contention low.
wouldn't checking the jobs be relatively quick? ... I just dont see how ... than dequeueing constantly
The problem is not (necessarily or only) that it would be slower. It is that busy loops sap cpu constantly redoing the same work over and over until something changes to allow it to move forward. (You've already seen this affect yourself as identified by your comment in your cut-down code:
#sleep 1; #slow it down, not useful for testing though
With my alternative, the central dispatcher thread sits blocked (consuming no cpu at all) in a dq, until a new job arrives, and wakes instantly (subject to getting a timeslice, usually a few microseconds) when something does. And then all it has to do is reQ the token to the appropriate resource Q and goes back to blocking.
If the resource is available, it wakes up instantly to process it. And if not it will grab it as soon as its done with the preceding jobs.
Conversely, with your single queue-cum-array affair, when a resource comes free it has to wait for you to notice (query the DB), build your hash; and then scan through all the pending jobs (sod's law says it will always be the last one), and then reQ it.
If one of your popular resources gets a long job (or hangs or crashes) and you get a build up of a backlog for that resource, then jobs for every other resource even idle ones, have to wait while you scan over that backlog every time.
Think of it like this (sad analogy time:). Would you rather have a receptionist in your local hospital that:
- just asked which dept. and directed you to their waiting room;
- or one that made you stand in a line at her desk with everyone else for every dept., whilst she rang around those departments asking if they were free for a client, and then walks down the line asking each person in turn if they needed one of the departments that said yes?
Blocking is always better than polling. Its why we have whistles on kettles and bells on phones and front doors. If you had to keep picking up the receiver to see if anyone was calling, phones would never have caught on :)
Definitely an intriguing idea. I'll keep this in mind, ...
If you went for the reQ to multiple Qs idea above, the no locking idea is a non-starter.
A bit more grist for your mill :) Good lock.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |