Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: bdickinson_at_[hidden]
Date: 2006-07-28 10:15:19


I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process). Is there a way that I get get these
processes to spawn on different machines?

One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file. Will
this work? Is this the best route to go?

Thanks for any help with this.

Byron