Very good document.
About the MPI layer (in case of fault), my idea is to give to BML the
ability to handle BTL errors which occurs when a process die (and
probably have been migrated), discovering the new location. I think that
it is possible because the HNP request the restart for the orted daemon,
so it knows the new location of the faulty process.
Ralph Castain escribió:
> If you look at the Dec meeting wiki, you will see that we are moving
> quickly to a modex-less launch anyway. It won't be the default because
> it requires pre-discovery of the cluster's network resources (for
> which we will provide a tool or method), but it will help resolve some
> of these problems.
> Outside of that, I will have to leave it to the FT folks to figure out
> how to resolve modex situations. We have the ability to support
> multiple modex models (and already do), but I don't know if you can do
> what you describe or not - I'm not sure how the MPI layer will handle
> that situation.
> On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:
>> I agree with your viewpoint, principally about the "reachability".
>> Looking from the FT viewpoint, sometimes (or some FT architectures),
>> wants to recover an application process on other node different from
>> the first. In this case a new modex should be called. It's fine for
>> coordinated C/R, on the other hand, for uncoordinated C/R its not a
>> good choice, I think. One more time the tradeoffs...
>> A possible solution is to perform n-1 modex involving the recovered
>> process and each one of the other processes... It's better than an
>> allgather modex? I don't now. I think not. And what is the impact of
>> a allgather modex while MPI thread is delivering messages? These
>> answers about these questions could suggest that a uncoordinated C/R
>> is not possible on Open MPI.
>> Leonardo Fialho
>> Jeff Squyres escribió:
>>> On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
>>>> I understand that a process need to have the contact information to
>>>> send MPI messages to other processes, and modex permits it. My
>>>> question is, why do not perform the contact exchange when it is
>>>> For example: in a M/W application, the workers does not need more
>>>> information than the masters contact info.
>>>> I think that it reduces the startup time, but increases the *first*
>>>> communication between two peers.
>>> FWIW, this is actually a pretty complex topic. There are many, many
>>> tradeoffs in terms of what performance do you want vs. what
>>> functionality do you want. This subject has been discussed for
>>> many, many hours by the OMPI developers. :-)
>>> The modex is performed during MPI_INIT; the v1.3 series' modex is
>>> quite a bit more efficient than the v1.2 series' modex. The modex
>>> information comprises of several things, some of which are either
>>> the contact info or "reachability" info of BTL modules. For the
>>> openib BTL, for example, port subnet ID's and MTU's are passed in
>>> the modex, but LIDs don't need to be passed (in some cases) until
>>> two processes actually try to reach each other. We use the
>>> reachability information to determine whether a given BTL module
>>> *could* be used to connect to a remote peer. For example, if we get
>>> to the end of MPI_INIT and find a peer that cannot be reached, we
>>> abort (after hours of debate, we decided it was better to abort
>>> right away when there was a peer that could not be reached rather
>>> than abort only on attempted first contact because it could be a
>>> simple network/configuration error that should be detected
>>> immediately, rather than erroring out [potentially] long into a
>>> multi-hour run).
>>> We have been discussing a "modex-less" startup for quite a while;
>>> this is actually one of the topics on the agenda for an engineering
>>> meeting that we're having December. modex-less is quite important
>>> for scalability to many thousands of processes, but other tradeoffs
>>> may be necessary to make this work (read: we've talked about
>>> modex-less for forever; we're finally likely to do it in the near
>>> future because of some upcoming very very large scale machines at US
>>> DOE labs).
>>> Does that make sense?
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>> devel mailing list
> devel mailing list
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088