in reply to
Do you anticipate needing to run anything in parallel? If so, MPI / PVM might be worth investigating. If not, things can be simpler.
If all you need to do is divide up large jobs into small pieces, send them off to other systems (i.e. cluster or grid nodes) to run, then collect the results. This could result in a system with sections to:
- Take a large job and divide it into N small jobs
- A central (or distributed) system for keeping track of job status. This could be database tables and/or a central process.
- Queues for submitting jobs to nodes and taking appropriate action when they are completed. If results are stored centrally in a database, this might mean only monitoring node status and tracking errors as well as completion of jobs.
It's not impossible for one person, but the complexity can be high and debugging distributed systems can be tricky.
A commercial product called LSF (www.platform.com) does much of this and works well with cluster applications. With it, one may create queues for specific systems or sets of systems, submit jobs, and monitor them.
For grid versus cluster applications, all this gets more fuzzy. Are the grid systems shared? How reliable are they? How much redundancy is needed?
Like I said, these are some thoughts and questions. Hope it's useful. :)