Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Modex and others
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-11-13 08:36:58

If you look at the Dec meeting wiki, you will see that we are moving
quickly to a modex-less launch anyway. It won't be the default because
it requires pre-discovery of the cluster's network resources (for
which we will provide a tool or method), but it will help resolve some
of these problems.

Outside of that, I will have to leave it to the FT folks to figure out
how to resolve modex situations. We have the ability to support
multiple modex models (and already do), but I don't know if you can do
what you describe or not - I'm not sure how the MPI layer will handle
that situation.


On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:

> Jeff,
> I agree with your viewpoint, principally about the "reachability".
> But...
> Looking from the FT viewpoint, sometimes (or some FT architectures),
> wants to recover an application process on other node different from
> the first. In this case a new modex should be called. It's fine for
> coordinated C/R, on the other hand, for uncoordinated C/R its not a
> good choice, I think. One more time the tradeoffs...
> A possible solution is to perform n-1 modex involving the recovered
> process and each one of the other processes... It's better than an
> allgather modex? I don't now. I think not. And what is the impact of
> a allgather modex while MPI thread is delivering messages? These
> answers about these questions could suggest that a uncoordinated C/R
> is not possible on Open MPI.
> Leonardo Fialho
> Jeff Squyres escribió:
>> On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
>>> I understand that a process need to have the contact information
>>> to send MPI messages to other processes, and modex permits it. My
>>> question is, why do not perform the contact exchange when it is
>>> necessary?
>>> For example: in a M/W application, the workers does not need more
>>> information than the masters contact info.
>>> I think that it reduces the startup time, but increases the
>>> *first* communication between two peers.
>> FWIW, this is actually a pretty complex topic. There are many,
>> many tradeoffs in terms of what performance do you want vs. what
>> functionality do you want. This subject has been discussed for
>> many, many hours by the OMPI developers. :-)
>> The modex is performed during MPI_INIT; the v1.3 series' modex is
>> quite a bit more efficient than the v1.2 series' modex. The modex
>> information comprises of several things, some of which are either
>> the contact info or "reachability" info of BTL modules. For the
>> openib BTL, for example, port subnet ID's and MTU's are passed in
>> the modex, but LIDs don't need to be passed (in some cases) until
>> two processes actually try to reach each other. We use the
>> reachability information to determine whether a given BTL module
>> *could* be used to connect to a remote peer. For example, if we
>> get to the end of MPI_INIT and find a peer that cannot be reached,
>> we abort (after hours of debate, we decided it was better to abort
>> right away when there was a peer that could not be reached rather
>> than abort only on attempted first contact because it could be a
>> simple network/configuration error that should be detected
>> immediately, rather than erroring out [potentially] long into a
>> multi-hour run).
>> We have been discussing a "modex-less" startup for quite a while;
>> this is actually one of the topics on the agenda for an engineering
>> meeting that we're having December. modex-less is quite important
>> for scalability to many thousands of processes, but other tradeoffs
>> may be necessary to make this work (read: we've talked about modex-
>> less for forever; we're finally likely to do it in the near future
>> because of some upcoming very very large scale machines at US DOE
>> labs).
>> Does that make sense?
> --
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
> _______________________________________________
> devel mailing list
> devel_at_[hidden]