I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process). Is there a way that I get get these
processes to spawn on different machines?
One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file. Will
this work? Is this the best route to go?
Thanks for any help with this.