The connection mechanism in the MX BTL suffers from a big problem on
multi-rail (if all NICS are identical). If the rails are connected
using the same mapper, they will have identical ID. Unfortunately,
these ID are supposed to be unique in order to guarantee the
connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
MAC). However, the outcome I saw in the past in this case is not a
deadlock but a poorly distribution of the data over the two NICS: one
will be over-loaded while the other will not be used at all.
There is no answer from a peer when we connect the MX BTLs. If the
steps are the ones you described in your email, then I guess both of
the peers try to connect to the other simultaneously. Now, when you
have multiple rails, we treat them at the upper level as independent
devices, and we will try to load balance the messages over all of
them. The step (3) seems to indicate that another message (MPI) has
been sent, and because of the load balancing scheme we try to connect
the second device (rail in this context). In MX this works because we
use the blocking function (mx_connect).
On Jun 17, 2009, at 08:23 , Brice Goglin wrote:
> I am debugging some sort of deadlock when doing multirail over Open-
> What I am seeing with 2 processes and 2 boards per node with *MX* is:
> 1) process 0 rail 0 connects to process 1 rail 0
> 2) p1r0 connects back to p0r0
> 3) p0 rail 1 connects to p1 rail 1
> 4) p1r1 connects back to p0r1
> For some reason, with *Open-MX*, process 0 seems to start (3) before
> process 1 has finished (2). It probably causes a deadlock because p1
> polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
> for the connect handshake.
> So, the question is: is there anything in OMPI (1.3) guarantying that
> the above 4 steps will occur in some specified order? If so, Open-MX
> probably doing something wrong breaking the order. If not, adding a
> progression thread to Open-MX might be the only solution...
> devel mailing list