I am debugging some sort of deadlock when doing multirail over Open-MX.
What I am seeing with 2 processes and 2 boards per node with *MX* is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because p1 is
polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
for the connect handshake.
So, the question is: is there anything in OMPI (1.3) guarantying that
the above 4 steps will occur in some specified order? If so, Open-MX is
probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...