Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] connect management for multirail (Open-)MX
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-06-17 14:45:06


Yes, in Open MPI the connections are usually created on demand. As far
as I know there are few devices that do not abide to this "law", but
MX is not one of them.

To be more precise on how the connections are established, if we say
that each node has two rails and we're doing a ping-pong, the first
message from p0 to p1 will connect the first NIC, and the second
message the second NIC (here I made the assumption that both network
are similar). Moreover in MX, the connection is not symmetric, so your
(1) and (2) might happens simultaneously.

Does the code contain an MPI_Barrier ? If yes, this might be why you
see the sequence (1), (2), (3) and (4) ...

   george.

On Jun 17, 2009, at 12:13 , Brice Goglin wrote:

> Thanks for the answer. So if I understand correctly, the connection
> order is decided dynamically depending on when each peer has some
> messages to send and how the upper level load-balances them. There
> shouldn't be anything preventing (1) and (2) from happening at the
> same
> time then. And I wonder why I always see 1,2,3,4 with MX (using IMB)
> and
> not with Open-MX...
>
> Brice
>
>
>
> George Bosilca wrote:
>> Brice,
>>
>> The connection mechanism in the MX BTL suffers from a big problem on
>> multi-rail (if all NICS are identical). If the rails are connected
>> using the same mapper, they will have identical ID. Unfortunately,
>> these ID are supposed to be unique in order to guarantee the
>> connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
>> MAC). However, the outcome I saw in the past in this case is not a
>> deadlock but a poorly distribution of the data over the two NICS: one
>> will be over-loaded while the other will not be used at all.
>>
>> There is no answer from a peer when we connect the MX BTLs. If the
>> steps are the ones you described in your email, then I guess both of
>> the peers try to connect to the other simultaneously. Now, when you
>> have multiple rails, we treat them at the upper level as independent
>> devices, and we will try to load balance the messages over all of
>> them. The step (3) seems to indicate that another message (MPI) has
>> been sent, and because of the load balancing scheme we try to connect
>> the second device (rail in this context). In MX this works because we
>> use the blocking function (mx_connect).
>>
>> george.
>>
>> On Jun 17, 2009, at 08:23 , Brice Goglin wrote:
>>
>>> Hello,
>>>
>>> I am debugging some sort of deadlock when doing multirail over
>>> Open-MX.
>>> What I am seeing with 2 processes and 2 boards per node with *MX*
>>> is:
>>> 1) process 0 rail 0 connects to process 1 rail 0
>>> 2) p1r0 connects back to p0r0
>>> 3) p0 rail 1 connects to p1 rail 1
>>> 4) p1r1 connects back to p0r1
>>> For some reason, with *Open-MX*, process 0 seems to start (3) before
>>> process 1 has finished (2). It probably causes a deadlock because
>>> p1 is
>>> polling on rail 0 for (2), while (3) needs somebody to poll on
>>> rail 1
>>> for the connect handshake.
>>>
>>> So, the question is: is there anything in OMPI (1.3) guarantying
>>> that
>>> the above 4 steps will occur in some specified order? If so, Open-
>>> MX is
>>> probably doing something wrong breaking the order. If not, adding a
>>> progression thread to Open-MX might be the only solution...
>>>
>>> thanks,
>>> Brice
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel