Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] connect management for multirail (Open-)MX
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2009-06-17 12:13:55


Thanks for the answer. So if I understand correctly, the connection
order is decided dynamically depending on when each peer has some
messages to send and how the upper level load-balances them. There
shouldn't be anything preventing (1) and (2) from happening at the same
time then. And I wonder why I always see 1,2,3,4 with MX (using IMB) and
not with Open-MX...

Brice

George Bosilca wrote:
> Brice,
>
> The connection mechanism in the MX BTL suffers from a big problem on
> multi-rail (if all NICS are identical). If the rails are connected
> using the same mapper, they will have identical ID. Unfortunately,
> these ID are supposed to be unique in order to guarantee the
> connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
> MAC). However, the outcome I saw in the past in this case is not a
> deadlock but a poorly distribution of the data over the two NICS: one
> will be over-loaded while the other will not be used at all.
>
> There is no answer from a peer when we connect the MX BTLs. If the
> steps are the ones you described in your email, then I guess both of
> the peers try to connect to the other simultaneously. Now, when you
> have multiple rails, we treat them at the upper level as independent
> devices, and we will try to load balance the messages over all of
> them. The step (3) seems to indicate that another message (MPI) has
> been sent, and because of the load balancing scheme we try to connect
> the second device (rail in this context). In MX this works because we
> use the blocking function (mx_connect).
>
> george.
>
> On Jun 17, 2009, at 08:23 , Brice Goglin wrote:
>
>> Hello,
>>
>> I am debugging some sort of deadlock when doing multirail over Open-MX.
>> What I am seeing with 2 processes and 2 boards per node with *MX* is:
>> 1) process 0 rail 0 connects to process 1 rail 0
>> 2) p1r0 connects back to p0r0
>> 3) p0 rail 1 connects to p1 rail 1
>> 4) p1r1 connects back to p0r1
>> For some reason, with *Open-MX*, process 0 seems to start (3) before
>> process 1 has finished (2). It probably causes a deadlock because p1 is
>> polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
>> for the connect handshake.
>>
>> So, the question is: is there anything in OMPI (1.3) guarantying that
>> the above 4 steps will occur in some specified order? If so, Open-MX is
>> probably doing something wrong breaking the order. If not, adding a
>> progression thread to Open-MX might be the only solution...
>>
>> thanks,
>> Brice
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel