This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
On 30 August 2011 02:55, Ralph Castain <rhc_at_[hidden]> wrote:
> Instead, all used dynamic requests - i.e., the job that was doing a comm_spawn would request resources at the time of the comm_spawn call. I would pass the request to Torque, and if resources were available, immediately process them into OMPI and spawn the new job. If resources weren't available, I simply returned an error to the program so it could either (a) terminate, or (b) wait awhile and try again. One of the groups got ambitious and supported non-blocking requests (generated a callback to me with resources when they became available). Worked pretty well - might work even better once we get non-blocking MPI_Comm_spawn.
> I believe they generally were happy with the results, though I think some of them wound up having Torque "hold" a global pool of resources to satisfy such requests, just to avoid blocking progress on the job while waiting for comm_spawn resources.
Quite often on a larger cluster there are several jobs running
simultaneously - and you configure the batch scheduler to select
groups of nodes which are physically close to each other as you get a
bit more performance that way.
However, if (say) a node is down for maintenance it can knock these
patterns out. Could we forsee a dynamic movement of MPI jobs which
move back to a node when it is replaced?