This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
I am debugging some sort of deadlock when doing multirail over Open-MX.
What I am seeing with 2 processes and 2 boards per node with *MX* is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because p1 is
polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
for the connect handshake.
So, the question is: is there anything in OMPI (1.3) guarantying that
the above 4 steps will occur in some specified order? If so, Open-MX is
probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...