Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2006-07-28 10:53:40

don't forget furthermore, that for successfully using this
fault-tolerance approach, the parents or other child processes should
not be affected by the death/failure of another child process. Right now
in Open MPI, if one of the child processes (which you spawned using
MPI_Comm_spawn) fails, the whole application will fail. [To be more
precise: the MPI standard does not enforce/mandate the behavior
described in the paper which you mentioned]


Josh Hursey wrote:
>> I have implemented the fault tolerance method in which you would use
>> MPI_COMM_SPAWN to dynamically create communication groups and use
>> those communicators for a form of process fault tolerance (as
>> described by William Gropp and Ewing Lusk in their 2004 paper),
>> but am having some problems getting it to work the way I intended.
>> Basically, when it runs, it is spawning all the processes on the
>> same machine (as it always starts at the top of the machine_list
>> when spawning a process). Is there a way that I get get these
>> processes to spawn on different machines?
> In Open MPI (and most other MPI implementations) you will be restricted to
> using only the machines in your allocation when you use MPI_Comm_spawn*.
> The standard allows you can suggest to MPI_Comm_spawn where to place the
> 'children' that it creates using the MPI_Info key -- specifically the
> {host} keyvalue referenced here:
> MPI_Info is described here:
> Open MPI, in the current release, does not do anything with this key.
> This has been fixed in subversion (as of r11039) and will be in the next
> release of Open MPI.
> If you want to use this functionality in the near term I would suggest
> using the nightly build of the subversion trunk available here:
>> One possible route I considerd was using something like SLURM to
>> distribute the jobs, and just putting '+' in the machine file. Will
>> this work? Is this the best route to go?
> Off the top of my head, I'm not sure if that would work of not. The
> best/cleanest route would be to use the MPI_Info command and the {host}
> key.
> Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
> this scenario.
> Hope that helps,
> Josh
>> Thanks for any help with this.
>> Byron
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]