Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2006-07-28 10:53:40


don't forget furthermore, that for successfully using this
fault-tolerance approach, the parents or other child processes should
not be affected by the death/failure of another child process. Right now
in Open MPI, if one of the child processes (which you spawned using
MPI_Comm_spawn) fails, the whole application will fail. [To be more
precise: the MPI standard does not enforce/mandate the behavior
described in the paper which you mentioned]

Thanks
Edgar

Josh Hursey wrote:
>> I have implemented the fault tolerance method in which you would use
>> MPI_COMM_SPAWN to dynamically create communication groups and use
>> those communicators for a form of process fault tolerance (as
>> described by William Gropp and Ewing Lusk in their 2004 paper),
>> but am having some problems getting it to work the way I intended.
>> Basically, when it runs, it is spawning all the processes on the
>> same machine (as it always starts at the top of the machine_list
>> when spawning a process). Is there a way that I get get these
>> processes to spawn on different machines?
>>
>
> In Open MPI (and most other MPI implementations) you will be restricted to
> using only the machines in your allocation when you use MPI_Comm_spawn*.
> The standard allows you can suggest to MPI_Comm_spawn where to place the
> 'children' that it creates using the MPI_Info key -- specifically the
> {host} keyvalue referenced here:
> http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
> MPI_Info is described here:
> http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53
>
> Open MPI, in the current release, does not do anything with this key.
> This has been fixed in subversion (as of r11039) and will be in the next
> release of Open MPI.
>
> If you want to use this functionality in the near term I would suggest
> using the nightly build of the subversion trunk available here:
> http://www.open-mpi.org/nightly/trunk/
>
>
>> One possible route I considerd was using something like SLURM to
>> distribute the jobs, and just putting '+' in the machine file. Will
>> this work? Is this the best route to go?
>
> Off the top of my head, I'm not sure if that would work of not. The
> best/cleanest route would be to use the MPI_Info command and the {host}
> key.
>
> Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
> this scenario.
>
> Hope that helps,
> Josh
>
>> Thanks for any help with this.
>>
>> Byron
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users