This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process). Is there a way that I get get these
processes to spawn on different machines?
One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file. Will
this work? Is this the best route to go?
Thanks for any help with this.